Threads

December 06, 2025

TITANS & MIRAS: real continual learning

TITANS & MIRAS: real continual learning MIRAS = a unifying theory of transformers (attention) and state space models (SSM, e.g. Mamba, RNNs) TITANS...

27 likes 7 posts

December 05, 2025

Anthropic is going after enterprises because its predictable

Anthropic is going after enterprises because its predictable they’re going public, not because they need cash from public markets, but because enterp...

39 likes 2 posts

December 04, 2025

i thought this was a generic agent product for interviewing people

i thought this was a generic agent product for interviewing people it’s not, it’s just a one-off project to understand people’s perspectives on AI i...

26 likes 2 posts

December 04, 2025

it’s the year of our lord 2025 and auth is still near impossible

36 likes 2 posts

December 04, 2025

OpenAI trained a GPT-5 variant to admit when it took shortcuts

OpenAI trained a GPT-5 variant to admit when it took shortcuts openai.com/index/how-co...

27 likes 2 posts

December 03, 2025

Apparently OpenAI plans on releasing “Onion” next week, potentially as GPT-5....

Apparently OpenAI plans on releasing “Onion” next week, potentially as GPT-5.2 or GPT-5.5 Shallotpeat = a huge new model, fixing pretraining bugs in ...

23 likes 3 posts

December 02, 2025

you can absolutely build programming & query languages on top of JSON or YAML...

you can absolutely build programming & query languages on top of JSON or YAML. You can use it as a lexer to a larger language. They always work becaus...

21 likes 2 posts

December 02, 2025

DSA: DeepSeek Sparse Attention

DSA: DeepSeek Sparse Attention DeepSeek 3.2 & 3.2-Speciale are ridiculously cheap because of DSA LLMs aren’t quadratic anymore They trained an addi...

69 likes 3 posts

December 02, 2025

Claude Opus “soul document”

Claude Opus “soul document” Opus 4.5 was indeed (confirmed) trained with a “soul document”, a prompt included in both supervised & reinforcement lear...

62 likes 2 posts

December 01, 2025

DeepSeek 3.2

DeepSeek 3.2 2 new models: * 3.2: a open weights GPT-5-High competitor that’s fully agentic * 3.2-Speciale: a maxxed-out version of 3.2 that achiev...

78 likes 7 posts

November 30, 2025

i’m looking at the ChatGPT & Gemini apps, reverse engineering them

i’m looking at the ChatGPT & Gemini apps, reverse engineering them ChatGPT has a “guardian_tool” where it can fetch policies here’s what mine has, t...

24 likes 8 posts

November 29, 2025

this whole smol model thing that @dorialexander.bsky.social started is remind...

this whole smol model thing that @dorialexander.bsky.social started is reminding me of entropix independent researchers working out in the open on a ...

24 likes 2 posts

November 29, 2025

the experiment continues — recognizable text below 1M parameters(!!)

the experiment continues — recognizable text below 1M parameters(!!) they exposed it to 1B tokens of the SYNTH dataset, probably can train longer

41 likes 4 posts

November 29, 2025

1. yeah, i was annoyed that no one else noticed that he isn’t actually anti p...

1. yeah, i was annoyed that no one else noticed that he isn’t actually anti pretrain scaling 2. pretrain scaling is practically his idea, so ofc he’s...

24 likes 3 posts

November 29, 2025

i let my 9yo use sora for a few minutes and she used up my quota remixing eve...

i let my 9yo use sora for a few minutes and she used up my quota remixing everything into hamsters otoh my notifications are now clogged with likes, ...

40 likes 2 posts

November 28, 2025

Semianalysis: TPU dominance

Semianalysis: TPU dominance Fascinating article. They argue that the reason for NVIDIA’s circular investment deals is to intertwine their own fate w...

47 likes 5 posts

November 27, 2025

DeepSeek-Math-V2: self-verification

DeepSeek-Math-V2: self-verification Fascinating paper that explores how to RL but focused on process over outcome It’s sort of similar to a GAN, bu...

33 likes 4 posts

November 26, 2025

Summary — He's got a divergent view of AGI

Summary — He's got a divergent view of AGI We're all pursuing a single behemoth that is *already* smarter than all humans when it's launched He's pu...

30 likes 2 posts

November 25, 2025

Ilya!!!!

Ilya!!!! www.dwarkesh.com/p/ilya-sutsk...

71 likes 11 posts

November 24, 2025

i reeeally hope this is what it looks like

i reeeally hope this is what it looks like i’d love to hear from Ilya, and also i assume Ilya wouldn’t talk unless he had something interesting to sa...

29 likes 2 posts

November 24, 2025

Opus 4.5

Opus 4.5 Now 1/3rd the cost, and SOTA in programming Like Gemini 3 Pro, people note that it can see a lot deeper into tough problems. That big mode...

36 likes 5 posts

November 24, 2025

Anthropic has no competitors, because nobody else sells Claude

Anthropic has no competitors, because nobody else sells Claude we’re expecting Opus 4.5 soon, and time will tell if they understand this It’s over i...

23 likes 3 posts

November 24, 2025

my benchmark for AI models is how much they change life for me, normally it t...

my benchmark for AI models is how much they change life for me, normally it takes a few weeks to run, but: - Gemini 3 + nano banana is massive, proba...

29 likes 4 posts

November 23, 2025

the biggest reason for fully open models is science, and the downstream effects

the biggest reason for fully open models is science, and the downstream effects 1. rebuild it, but with your own domain-specific mid-training data 2....

32 likes 2 posts

November 22, 2025

at AI Engineer Summit workshops

at AI Engineer Summit workshops the MCP session started, 30 min later Claude Code SDK session started and there was a mass migration over to Claude S...

30 likes 8 posts

November 22, 2025

Wild — regular Gemini 3 searched for stock photos for no other reason than it...

Wild — regular Gemini 3 searched for stock photos for no other reason than it would *really* spice up this explanation 🤯

43 likes 2 posts

November 22, 2025

one AI trend that’s not fully baked, but will change a huge amount — auto-com...

one AI trend that’s not fully baked, but will change a huge amount — auto-compaction Sonnet 4.5 introduced it, codemax also does it afaict there’s ...

38 likes 2 posts

November 21, 2025

two views from Anthropic

two views from Anthropic 1. Claude Skills are for collaboration 2. Skills are for continual learning

39 likes 11 posts

November 20, 2025

Nano Banana Pro

Nano Banana Pro A reasoning model image generator! - multi turn - google search grounding - interleaved image & text console.cloud.google.com/vert...

33 likes 3 posts

November 20, 2025

i read through ~5 pages of the Olmo 3 tech report.. whoah

i read through ~5 pages of the Olmo 3 tech report.. whoah this is the best and most detailed summary of the current state of SOTA LLM training nanoc...

42 likes 2 posts

November 20, 2025

Olmo 3 7B & 32B base & thinking models

Olmo 3 7B & 32B base & thinking models @ai2.bsky.social has done it again, fully open models, fully open process seems competitive with Qwen 3, exce...

47 likes 4 posts

November 19, 2025

Evidence that Gemini 3 is very large:

Evidence that Gemini 3 is very large: 1. the QT 2. Artificial analysis (image) quote: x.com/artificialan... report: artificialanalysis.ai/evaluations...

22 likes 6 posts

November 19, 2025

Trying out antigravity

Trying out antigravity 1. It's an IDE, definitely an IDE. Sure it can probably do more, but it's an IDE 2. You can interrupt it without stopping it ...

41 likes 6 posts

November 18, 2025

even MORE??

28 likes 2 posts

November 18, 2025

Google Antigravity: an agentic-first software IDE

Google Antigravity: an agentic-first software IDE a ground-up redesign of software development with agents at the center of antigravity.google

31 likes 2 posts

November 18, 2025

Gemini 3 model card leaked

Gemini 3 model card leaked the URL is taken down now, was here: storage.googleapis.com/deepmind-med...

65 likes 7 posts

November 16, 2025

the wild part — this probably doesn’t have anything to do with AI or LLM arch...

the wild part — this probably doesn’t have anything to do with AI or LLM architecture e.g. the gospels of Matthew, Mark, Luke & John are all rephras...

36 likes 2 posts

November 16, 2025

Physics of Language Models: Part 3.1

Physics of Language Models: Part 3.1 If you show a fact to an LLM in pre-training once, it’ll memorize the form but not the fact itself. but if you...

74 likes 5 posts

November 16, 2025

codex truncated tool calls

codex truncated tool calls apparently it’s because the tokenizer isn’t open source, isn’t compatible with the one that tiktoken ships the API call t...

28 likes 2 posts

November 16, 2025

impressions so far, ~1 hour in

impressions so far, ~1 hour in * K2 in Claude Code is slow, and doesn't have a dramatically different feel from other models * Rust with AI seems fi...

29 likes 3 posts

November 16, 2025

lettuce begin

21 likes 2 posts

November 15, 2025

Google followed scaling laws and scaled up Gemini 3

Google followed scaling laws and scaled up Gemini 3 OpenAI abandoned scaling laws for GPT-5 soon we’ll see who was right. OpenAI’s is surely cheaper...

34 likes 2 posts

November 13, 2025

Sparse Circuits

Sparse Circuits a new mech interp paper from OpenAI proposes a way to train models so that they’re natively easier to understand openai.com/index/un...

23 likes 2 posts

November 13, 2025

GPT-5.1: a personality upgrade

GPT-5.1: a personality upgrade Instant is now a better conversation partner, and Thinking applies thinking more dynamically * 2x fast for easy reque...

23 likes 3 posts

November 13, 2025

Gemini 3 on par with experts

Gemini 3 on par with experts this riveting (surprise!) tale from a determined 18th-19th century historian explains 1. OCR can be difficult, excrucia...

40 likes 3 posts

November 12, 2025

Looped LLMs

Looped LLMs Unlike RNNs, HRM, etc., this starts with an already pretrained LLM, adds a loop, and trains a little longer it’s a very frugal approach ...

32 likes 5 posts

November 11, 2025

80 layers — for those not paying attention, @dorialexander.bsky.social has be...

80 layers — for those not paying attention, @dorialexander.bsky.social has been posting for weeks about how small models with deep rather than wide la...

58 likes 5 posts

November 09, 2025

most engineers probably don’t understand the extent to which “tech debt” work...

most engineers probably don’t understand the extent to which “tech debt” works as an abstraction there’s absolutely “good tech debt”, and it’s not a...

35 likes 5 posts

November 07, 2025

Surprising: Math requires a lot of memorization

Surprising: Math requires a lot of memorization Goodfire is at it again! They developed a method similar to PCA that measures how much of an LLM’s ...

40 likes 4 posts

November 07, 2025

notable: they ripped out the silicon that supports training

notable: they ripped out the silicon that supports training they say: “it’s the age of inference” which, yeah, RL is mostly inference. Continual lea...

23 likes 2 posts

November 06, 2025

closed source now lags open

30 likes 3 posts

November 06, 2025

Kimi K2-Thinking

Kimi K2-Thinking a new leader? moonshotai.github.io/Kimi-K2/thin...

30 likes 2 posts

November 05, 2025

Windsurf Codemaps

Windsurf Codemaps actually this makes a ton of sense — if vibe coding only works on small/non-complex projects, then the answer is to tackle complexi...

21 likes 2 posts

November 04, 2025

Anthropic Model Depreciation Process

Anthropic Model Depreciation Process Anthropic sweetly asked Sonnet about its preferences in how it wanted to be deprecated in addition: - no, stil...

33 likes 2 posts

November 04, 2025

I added this to my AGENTS.md file (text in alt) and it seems to work well

I added this to my AGENTS.md file (text in alt) and it seems to work well i had an environment error, spun out a new codex-cli to figure it out, it w...

33 likes 2 posts

November 04, 2025

Consistency Training

Consistency Training new GDM research notes that both jailbreaking and sycophancy share a common cause — subtle changes in the prompt cause dramatic ...

23 likes 2 posts

November 04, 2025

MCP Colors

MCP Colors A riff off of the lethal trifecta for addressing prompt injection, this is a simple heuristic to ensure security at runtime red = untrust...

32 likes 2 posts

November 04, 2025

Cache to Cache: let agents communicate in KV cache latent space

Cache to Cache: let agents communicate in KV cache latent space Instead of con concatenating the text from one agent into another, just concatenate t...

44 likes 2 posts

November 03, 2025

all y’all who think LLMs are too sycophantic haven’t talked to a 5yo

23 likes 2 posts

November 02, 2025

LLMs can report their own experience

LLMs can report their own experience the most convincing experiment: they isolated the deception control vector and had it talk about its own conscio...

23 likes 3 posts

November 01, 2025

looked into it, codex-cli switched to Rust because:

looked into it, codex-cli switched to Rust because: 1. static binary / easy distribution 2. easier OS-level security controls 3. perf 4. extensibili...

24 likes 2 posts

November 01, 2025

fascinating paper — it explores how LLMs take different sides on various geop...

fascinating paper — it explores how LLMs take different sides on various geopolitical disputes based on what language they’re speaking in

40 likes 2 posts

November 01, 2025

MCP servers aren’t that useful for coding agents, imo

MCP servers aren’t that useful for coding agents, imo a little, yes, but the bulk of the utility is for non-coding agent use cases. accessing databas...

27 likes 2 posts

October 31, 2025

these graphs are nuts

these graphs are nuts bf16 has been the only way training has been done for nearly a decade all this year tons of resources have been dumped into RL...

54 likes 2 posts

October 31, 2025

astonishing: using fp16 instead of bf16 results in more stable training runs ...

astonishing: using fp16 instead of bf16 results in more stable training runs as well as a smaller performance gap between training & inference this ...

38 likes 4 posts

October 31, 2025

OpenAI — GPT6 will be about continual learning

OpenAI — GPT6 will be about continual learning Anthropic — ??? GDM — pushing context out on smaller models Chinese labs — hoards of sparse/long att...

37 likes 2 posts

October 30, 2025

Kimi-Linear: more efficient attention

Kimi-Linear: more efficient attention New Moonshot model!! It’s a 48B-A3B acting as an experiment into new long-context efficient attention — a hyb...

26 likes 2 posts

October 30, 2025

automated research

44 likes 2 posts

October 30, 2025

gpt-oss-safeguard 20b & 120b

gpt-oss-safeguard 20b & 120b a pair of open weights models that let you enforce custom content moderation policies through prompts they’re reasoning...

38 likes 3 posts

October 29, 2025

one time i talked to an evangelical christian who believed that God works thr...

one time i talked to an evangelical christian who believed that God works through randomness he thought board games & card games were sinful because...

41 likes 2 posts

October 29, 2025

ah, good call, destroy your best asset. it was holding you back anyway

29 likes 2 posts

October 29, 2025

Cursor made an LLM

Cursor made an LLM it’s called Composer, it’s an extremely fast model that was previously available under code name Cheetah it’s an MoE trained in f...

26 likes 3 posts

October 28, 2025

IBM is cooking, apparently

IBM is cooking, apparently Granite-4-Nano is a 1B model that beats Qwen3 1.5B huggingface.co/blog/ibm-gra...

38 likes 2 posts

October 28, 2025

your embeddings are not safe!

your embeddings are not safe! every prompt directly maps to its embedding and back. they’re isomorphic SipIt is a linear time algorithm for quickly...

29 likes 2 posts

October 28, 2025

goal: get my 5yo to use “isomorphic” in conversation at kindergarten

24 likes 2 posts

October 27, 2025

🚨New Thinking Machines post🚨

🚨New Thinking Machines post🚨 i’m sorry, but you can’t skip TM blog posts, those are the rules this one is a phenomenal description of the strengths...

55 likes 6 posts

October 27, 2025

been getting deep into LLM serving and the most frustrating part is that EVER...

been getting deep into LLM serving and the most frustrating part is that EVERY model has different preferences about everything i’d like to say, “oh...

31 likes 2 posts

October 27, 2025

MiniMax open sources M2

MiniMax open sources M2 This model has been shaking the benchmarks last week, now that it’s open we see that it’s 230B-A10B and dueling (arguably bea...

34 likes 6 posts

October 26, 2025

there’s LIFO (queue ordering), and FIFO (stack), but have you heard of ADHD o...

there’s LIFO (queue ordering), and FIFO (stack), but have you heard of ADHD ordering? Some are calling it KNN-ordering. You always start the next ea...

59 likes 2 posts

October 26, 2025

the numbers for “short AGI timelines” are suspiciously close to the maximum a...

the numbers for “short AGI timelines” are suspiciously close to the maximum amount of time a VC is willing to wait for a liquidity event. just saying

109 likes 3 posts

October 26, 2025

ImpossibleBench: detect reward hacking

ImpossibleBench: detect reward hacking a benchmark that poses impossible tasks to see if LLMs cheat github.com/safety-resea...

60 likes 2 posts

October 25, 2025

the last ~30 minutes of the Dwarkesh+Karpathy podcast is all about education

the last ~30 minutes of the Dwarkesh+Karpathy podcast is all about education he talks about it as a technical problem to solve — “how to maximize eu...

28 likes 2 posts

October 23, 2025

Karpathy mentioned entropy collapse in LLMs, where they stop doing interestin...

Karpathy mentioned entropy collapse in LLMs, where they stop doing interesting things because they just don’t have interesting things in their trainin...

28 likes 3 posts

October 23, 2025

TRM reproduction report

TRM reproduction report okay, i’m starting to believe TRM is legit. The 5M does seem to hold up on almost all of its claims crazy. repro report: gi...

34 likes 2 posts

October 22, 2025

you can tell it's a bad one because the dashboard are all green

28 likes 2 posts

October 22, 2025

this is the grossest research i’ve seen today and i wish the contributors not...

this is the grossest research i’ve seen today and i wish the contributors nothing but failure in life

26 likes 2 posts

October 22, 2025

imo Dwarkesh is probably the best interviewer in tech, definitely in AI

imo Dwarkesh is probably the best interviewer in tech, definitely in AI he’s a bit awkward, but that almost makes it better. but his questions are su...

37 likes 2 posts

October 21, 2025

recently we got

recently we got 1. DeepSeek Sparse Attention (DSA) which solved the cost angle 2. DeepSeek-OCR which solved the performance angle also there’s mem...

55 likes 3 posts

October 21, 2025

Z.ai released a paper very similar to DeepSeek-OCR on the same exact day (a f...

Z.ai released a paper very similar to DeepSeek-OCR on the same exact day (a few hours earlier afaict) Glyph is just a framework, not a model, but the...

51 likes 2 posts

October 21, 2025

everyone, including Karpathy, is explaining DeepSeek-OCR as a victory of pixe...

everyone, including Karpathy, is explaining DeepSeek-OCR as a victory of pixels over unicode but imo its encoder-decoder over decoder-only transforme...

39 likes 2 posts

October 20, 2025

DeepSeek-OCR on handwriting

DeepSeek-OCR on handwriting still not as good as a pharmacist

22 likes 2 posts

October 20, 2025

i think this is the crux of DeepSeek-OCR

i think this is the crux of DeepSeek-OCR 1. (text) context gets longer as you add words 2. long context is quadratic 3. you can fit lots of words in ...

34 likes 4 posts

October 20, 2025

this paper deserves a deep breath and a slow exhaling “what the fuck”

this paper deserves a deep breath and a slow exhaling “what the fuck” who even talks about compression in OCR models? who tries to spin an OCR model...

45 likes 4 posts

October 20, 2025

DeepSeek-OCR

DeepSeek-OCR a tiny 3B-A0.5B MoE OCR model that runs fast on a single A100 40GB with very high precision and excellent compression why it’s cool — t...

52 likes 4 posts

October 18, 2025

in the Andrej Karpathy interview, he says that the code produced by AI is slop

in the Andrej Karpathy interview, he says that the code produced by AI is slop i think we’ve arrived at “slop is in-distribution”, boring because it’...

47 likes 4 posts

October 18, 2025

i listened to a lot of this while shuttling around kids earlier

i listened to a lot of this while shuttling around kids earlier the “RL is terrible” part was fascinating. like yeah, it’s kinda ridiculous. imagine...

27 likes 3 posts

October 17, 2025

mxbai-edge-colbert-v0: tiny long context multi-vector embedding models

mxbai-edge-colbert-v0: tiny long context multi-vector embedding models This report is huge, it gives us: - Apache 2 17M(!!) and 32M models - tops Lo...

32 likes 5 posts

October 16, 2025

the nearest term AI thing that honestly scares me is continual learning

the nearest term AI thing that honestly scares me is continual learning mainly that now there’s a data asset that’s maintained and is a barrier to sw...

28 likes 3 posts

October 16, 2025

Gemma 27B variant discovered a new cancer pathway treatment that has been val...

Gemma 27B variant discovered a new cancer pathway treatment that has been validated Scientists setup an environment and context, the model made a nov...

79 likes 2 posts

October 15, 2025

somewhere along the line, and i don't think it happened recently, LLMs got go...

somewhere along the line, and i don't think it happened recently, LLMs got good at math and i can trust them to do fairly complex computations on thei...

31 likes 2 posts

October 15, 2025

more movement to agentic behavior & computer use

more movement to agentic behavior & computer use cheaper than sonnet but on par (slightly better than) Sonnet 4.0

23 likes 2 posts

October 15, 2025

a major source of stress for me stems from:

a major source of stress for me stems from: 1. “Tim” and “team” sound the same to me in most Indian accents 2. Indians like to say things like, “oh, ...

26 likes 2 posts

October 15, 2025

correct

correct i’ve been saying this for a couple months. RL is driving towards specialization my hunch is it’s temporary and something will shift again b...

60 likes 4 posts

October 15, 2025

Firefox users can set Perplexity as their default web search

24 likes 2 posts

October 15, 2025

Is 32B-4bit equal to 16B-8bit? Depends on the task

Is 32B-4bit equal to 16B-8bit? Depends on the task * math: precision matters * knowledge: effective param count is more important * 4B-8bit threshol...

31 likes 3 posts

October 14, 2025

has anyone used AI to fix a bug or add a feature to OSS software just for you...

has anyone used AI to fix a bug or add a feature to OSS software just for your own personal use? feels like that should be happening a lot now

24 likes 2 posts

October 10, 2025

UserLM-8B: an LLM that mimics a person

UserLM-8B: an LLM that mimics a person Microsoft fine tuned an LLM to respond as the user instead of as an assistant This is useful anytime you nee...

34 likes 3 posts

October 08, 2025

it is absolutely wild to me that free-threaded python is stable, production-w...

it is absolutely wild to me that free-threaded python is stable, production-worthy they straight-up dropped the GIL. Python used to be not actually m...

43 likes 2 posts

October 06, 2025

i feel like DSPy is beginning to occupy the Haskell tier

i feel like DSPy is beginning to occupy the Haskell tier everyone: "DSPy is great, all LLM programming should be done like this" narrator: "No real ...

27 likes 3 posts

October 05, 2025

Dario Amodei:

Dario Amodei: “100 million words context window is already possible, which is roughly what a human hears in a lifetime. Inference support is the onl...

31 likes 4 posts

October 03, 2025

Muon optimizer learns more from rare data than Adam

Muon optimizer learns more from rare data than Adam i need to dig deeper. i think the industry is coalescing on: - Adam is faster - Muon is more sta...

26 likes 2 posts

October 03, 2025

i love this video

i love this video so much talk of it being the worst time in history, but there’s a huge opportunity to take control over your own life people with ...

29 likes 3 posts

October 02, 2025

Granite-4.0-H-Small: a 32B-A9B MoE Mamba for high efficency

Granite-4.0-H-Small: a 32B-A9B MoE Mamba for high efficency Damn! IBM is on the map. The American Qwen? I barely even knew IBM made LLMs, this is sol...

31 likes 3 posts

October 02, 2025

ngl, i did not understand short form AI video. i get it now

39 likes 3 posts

October 01, 2025

ThinkyMachines: Tinker LoRa training API

ThinkyMachines: Tinker LoRa training API ThinkingMachines announced their first product, telegraphed by a highly detailed blog earlier this week that...

22 likes 6 posts

October 01, 2025

years ago, “data scientist” was a PhD-only position at Big Tech, or just an a...

years ago, “data scientist” was a PhD-only position at Big Tech, or just an analyst anywhere else i think the term is now just the latter, and the fo...

25 likes 2 posts

October 01, 2025

RLP: Reinforcement Learning in Pre-Training

RLP: Reinforcement Learning in Pre-Training an NVIDIA paper explores using dense verifier-free RL in pretraining this feels significant. Everything...

29 likes 3 posts

September 29, 2025

Sonnet 4.5 is not fungible

Sonnet 4.5 is not fungible Cognition had to rewrite Devin for Sonnet 4.5 because: - it’s aware of its own context length and will aggressively summa...

23 likes 2 posts

September 29, 2025

Sonnet 4.5

Sonnet 4.5 Better than Opus 4.1 on almost every benchmark Still the classic Sonnet prices, $3/$15

26 likes 5 posts

September 29, 2025

Does AI get bored?

Does AI get bored? I gave them nothing to do, just to see what happens one thing — they devolve into a repetitive “collapse” state, I guess you coul...

76 likes 3 posts

September 28, 2025

is it good or bad that we’re not obsessing about the Dwarkesh Sutton intervie...

is it good or bad that we’re not obsessing about the Dwarkesh Sutton interview here?

21 likes 4 posts

September 27, 2025

this changes some things

29 likes 2 posts

September 25, 2025

ChatGPT Pulse

ChatGPT Pulse say some vague wish during chat, and Pulse will try to make it happen while you sleep like a personal assistant

21 likes 2 posts

September 25, 2025

just wasted 30 minutes on bash quoting and this, kids, is why you just use AI

25 likes 2 posts

September 24, 2025

Meta FAIR just released CWD: a dense 32B code world model

Meta FAIR just released CWD: a dense 32B code world model What’s a Code World Model? Well, it’s trained to know the effect of code, rather than just ...

31 likes 3 posts

September 23, 2025

Dario on why large open weights models are a farce

35 likes 4 posts

September 19, 2025

whoah, TIL

128 likes 2 posts

September 18, 2025

Anthropic full postmortem

Anthropic full postmortem - long context routing bug - output corruption bug - approx top-k bug on TPU plus: why it was so difficult and slow to fi...

33 likes 9 posts

September 17, 2025

DeepSeek published about R1 in Nature

DeepSeek published about R1 in Nature www.nature.com/articles/s41... but the supplementary information is far more detailed. it includes, lots of de...

40 likes 2 posts

September 17, 2025

the top 2 ARC entries are by individuals

the top 2 ARC entries are by individuals here, Eric Pang breaks down how he added memory to avoid recomputing learned lessons ctpang.substack.com/...

35 likes 4 posts

September 16, 2025

Qwen Tongyi Deep Research

Qwen Tongyi Deep Research a 32B model that beats other SOTA deep research on many benchmarks tongyi-agent.github.io/blog/introdu...

25 likes 5 posts

September 14, 2025

from the original GSMK8 paper, they project that it would require a 10 quadri...

from the original GSMK8 paper, they project that it would require a 10 quadrillion parameter model to get an 80% on GSMK8 Gemma 3 4B got 89%

23 likes 4 posts

September 14, 2025

what do we think about this?

what do we think about this? imo this is good practice. you should take ownership if everything the AI writes, otherwise no one owns it and no one is...

25 likes 2 posts

September 14, 2025

Why do LLMs fail at long horizon tasks?

Why do LLMs fail at long horizon tasks? Because errors in execution, not planning i.e. an error made early on conditions the LLM into a bad state la...

52 likes 6 posts

September 13, 2025

yo isn’t it wild that LLMs see all text all at once?

yo isn’t it wild that LLMs see all text all at once? like, what if you could look at a book, see all pages simultaneously, and be like, “i know king ...

35 likes 2 posts

September 13, 2025

its absurd how much progress there’s been

52 likes 2 posts

September 13, 2025

damn, some seriously good papers came out this week. i started writing a blog...

damn, some seriously good papers came out this week. i started writing a blog summarizing them, but then more came out and i had to read those before ...

100 likes 2 posts

September 12, 2025

NVIDIA’s moat just got bigger

NVIDIA’s moat just got bigger the Rubin CTX is for inference, and it’s a beast sold as a rack as the unit, this one optimizes various LLM phases int...

21 likes 2 posts

September 12, 2025

it’s true, the last two Qwens have been beautiful works of art, until you tal...

it’s true, the last two Qwens have been beautiful works of art, until you talk to them. gpt-oss same, even GPT-5 to some extent specifically this is ...

24 likes 4 posts

September 11, 2025

Qwen3-Next-80B-A3B Base, Instruct & Thinking

Qwen3-Next-80B-A3B Base, Instruct & Thinking - performs similar to Qwen3-235B-A22B - 10% the training cost of Qwen3-32B - 10x throughput of -32B - ou...

26 likes 3 posts

September 10, 2025

“LLMs are deterministic!”

“LLMs are deterministic!” no they’re not “well, if you set temperature = 0” still no “and remove floating point calculations?” keep going “and o...

47 likes 4 posts

September 10, 2025

iPhone Air has an A19 Pro chip, which is has native matmul

iPhone Air has an A19 Pro chip, which is has native matmul full NVIDIA inference speed in laptops, maybe even phones too? news.ycombinator.com/item?...

23 likes 2 posts

September 09, 2025

"LLMs predict the next token"

"LLMs predict the next token" this is insane, right? predict? as if there's a ground truth? like there's some predetermined state of the universe t...

27 likes 2 posts

September 09, 2025

the owners of TikTok scandalously release:

the owners of TikTok scandalously release: REER: a learning method that exposes the logic that led to a good result Booty: when shaken properly elic...

32 likes 3 posts

September 08, 2025

Claude Code was always impressive to me because i could keep it busy for long...

Claude Code was always impressive to me because i could keep it busy for long periods of time and keep projects moving forward GPT-5 on codex-cli isn...

23 likes 2 posts

September 08, 2025

i cancelled Claude subscription last night. was at $100/mo, now zero. GPT-5 i...

i cancelled Claude subscription last night. was at $100/mo, now zero. GPT-5 is fine i’ve heard others saying claude code works great on GLM-4.5 for $...

35 likes 3 posts

September 06, 2025

GPT-5 can’t write the seahorse emoji, and is enormously frustrated about that...

GPT-5 can’t write the seahorse emoji, and is enormously frustrated about that fact since we’re also talking about hallucinations today — notice how i...

31 likes 2 posts

September 06, 2025

Hallucinations are accidentally created by evals

Hallucinations are accidentally created by evals They come from post-training. Reasoning models hallucinate more because we do more rigorous post-tra...

65 likes 2 posts

September 06, 2025

i’ve been making a crap ton of scripts with gpt-5-high for the last couple we...

i’ve been making a crap ton of scripts with gpt-5-high for the last couple weeks, for training a model and it’s wild its code is basically perfect. t...

30 likes 2 posts

September 05, 2025

Skills are learned through RL!

Skills are learned through RL! pre-training — individual skills learned post-training (RL) — composed skills learned this is very clarifying to me,...

31 likes 3 posts

September 05, 2025

Kimi K2 0905 (K2.1) released

Kimi K2 0905 (K2.1) released - better front end coding - context increase 128K->256K huggingface.co/moonshotai/K...

32 likes 2 posts

September 04, 2025

new Qwen incoming

new Qwen incoming i’m guessing this will show up in a couple hours. curious if it’ll be 500B+

28 likes 2 posts

September 04, 2025

Automated Curriculum Learning

Automated Curriculum Learning in RL, the key is to progressively tackle harder and harder problems. More can be learned if the problem is “within rea...

21 likes 3 posts

September 04, 2025

i’ve read a ton of resumes in the last few weeks and i’m fairly sure the best...

i’ve read a ton of resumes in the last few weeks and i’m fairly sure the best way to write them is this - worry less about how many pages - AVOID wal...

23 likes 4 posts

September 02, 2025

NVIDIA stock is about to surge

NVIDIA stock is about to surge a VLDB paper from Microsoft demonstrates that SQL Server can be both faster and *cheaper* if run on a GPU www.vldb.or...

40 likes 2 posts

September 02, 2025

the GPT-5 system prompt explicitly says not to ask clarifying questions

the GPT-5 system prompt explicitly says not to ask clarifying questions i feel like we’re hitting the bitter lesson on theory of mind. they’re tryin...

26 likes 2 posts

September 02, 2025

vLLM breakdown blog post

vLLM breakdown blog post this is an excellent breakdown of how vLLM works (think ollama but for legit production workloads) great reading if you wan...

30 likes 2 posts

September 01, 2025

progressive thought refinement is a great way to use AI, i do this a lot

progressive thought refinement is a great way to use AI, i do this a lot 1. type long jumbled thoughts 2. AI searches unrealistically hard to make se...

38 likes 2 posts

September 01, 2025

mlx-knife: an ollama-like CLI for Apple Silicon

mlx-knife: an ollama-like CLI for Apple Silicon alright, this is the end of the road for me & ollama github.com/mzau/mlx-knife

25 likes 2 posts

August 31, 2025

Longcat-Flash-Chat (560B)

Longcat-Flash-Chat (560B) uh, holy shit this one is intriguing. bare minimum they compare themselves to all the (actual) top models and do okay but...

45 likes 6 posts

August 31, 2025

Limits of vector search

Limits of vector search a new GDM paper shows that embeddings can’t represent combinations of concepts well e.g. Dave likes blue trucks AND Ford tru...

81 likes 6 posts

August 29, 2025

The Second Half

The Second Half excellent blog post, i HIGHLY recommend reading it were in the 2nd half of AI 1st: massive pretraining & scale + benchmarks 2nd: R...

31 likes 3 posts

August 28, 2025

oh, this is a wild new take on AI development

oh, this is a wild new take on AI development Prime Intellect offers pre-built and shareable RL environments these are pre-built harnesses to trai...

44 likes 3 posts

August 27, 2025

ok, i’m convinced now. “RAG” is a meaningless term and we should stop using it

23 likes 2 posts

August 25, 2025

Starting with K2, several large “agentic coding” models weren’t trained as re...

Starting with K2, several large “agentic coding” models weren’t trained as reasoners: - K2 - GLM-4.5 (current SOTA open weights) - Opus 4.1 benchmark...

29 likes 2 posts

August 24, 2025

Motif 2.6B — compact model with long context

Motif 2.6B — compact model with long context unique: trained on AMD GPUs focus is on long context & low hallucination rate — imo this is a growing g...

42 likes 3 posts

August 23, 2025

As LLMs Improve, People Adapt Their Prompts

As LLMs Improve, People Adapt Their Prompts a study shows that a lot of the real world performance gains that people see are actually because people ...

44 likes 2 posts

August 23, 2025

i’ve gotten the opportunity at work to train a small LLM. my brain has been c...

i’ve gotten the opportunity at work to train a small LLM. my brain has been completely saturated with vacuuming up new information for the last bit

41 likes 2 posts

August 22, 2025

Mirage 2 — Real-time Generative World Model

Mirage 2 — Real-time Generative World Model It's like Google's Genie 3 but you can play it right now, in your browser (the queue isn't as long as it ...

27 likes 5 posts

August 21, 2025

i’m coming to the conclusion that most programmers are very bad at using LLMs...

i’m coming to the conclusion that most programmers are very bad at using LLMs, even the AI optimist ones i’ve seen a lot of recent data points that t...

37 likes 2 posts

August 21, 2025

Yann LeCun demoted

Yann LeCun demoted Meta Superintelligence did a massive reorg resulting in Yann reporting directly to Alexander Wang 4 teams: research, product, tra...

28 likes 3 posts

August 21, 2025

Mustafa Suleyman (of Microsoft) takes a stand against model welfare

Mustafa Suleyman (of Microsoft) takes a stand against model welfare He thinks it’s too early to be planning how to take care of models’ wellbeing ...

23 likes 3 posts

August 20, 2025

FYI you should be using the responses API if you’re using openai

FYI you should be using the responses API if you’re using openai it’s a higher level API, so it’s just plain easier to work with, but also they’ve do...

24 likes 2 posts

August 19, 2025

probably just ran out of cubicle tents

132 likes 2 posts

August 19, 2025

ok, so DeepSeek V3.1 does appear to be real. 128K context. unsure what other ...

ok, so DeepSeek V3.1 does appear to be real. 128K context. unsure what other details are real since they don’t announce the same way western labs do. ...

21 likes 2 posts

August 16, 2025

ngl this sounds like a silly thing to say, but the point is, they’re default ...

ngl this sounds like a silly thing to say, but the point is, they’re default alive if they stopped training models today, all VC funding was immediat...

51 likes 2 posts

August 16, 2025

dear GPT-5, i’m pretty sure “storagely” is not a word, but that oddly makes a...

dear GPT-5, i’m pretty sure “storagely” is not a word, but that oddly makes a ton of sense so maybe i’ll start using it

40 likes 4 posts

August 16, 2025

GPT-5 is massively better at offensive cybersecurity (hacking & pen testing)

GPT-5 is massively better at offensive cybersecurity (hacking & pen testing) The system card only claimed “moderate increase in risk”, but xbow foun...

30 likes 2 posts

August 15, 2025

HRM confirmed by ARC-AGI team, but also dismissed as non-generalizable

HRM confirmed by ARC-AGI team, but also dismissed as non-generalizable the magic wasn’t in the hierarchical structure, it was in the outer loop. And ...

26 likes 2 posts

August 15, 2025

note the date — 2019

note the date — 2019 this is a reply to OpenAI’s announcement of GPT-1 six years later, how close are we to novelist being an extinct profession?

26 likes 2 posts

August 13, 2025

instead of AGI we got.. gpt-4o withdrawal

instead of AGI we got.. gpt-4o withdrawal unexpectedly (or maybe expectedly), users formed a psychological bond with 4o and ripping it away seems to ...

25 likes 2 posts

August 12, 2025

Jan1-4B: a tiny local model that beats Perplexity Pro

Jan1-4B: a tiny local model that beats Perplexity Pro attach any search-related MCP and use vLLM or llama.cpp, or use the Janus app based on Qwen3 a...

25 likes 2 posts

August 11, 2025

ngl the GPT-5 backlash was surprising. i get the router gripes, but i didn’t ...

ngl the GPT-5 backlash was surprising. i get the router gripes, but i didn’t anticipate how many people expected GPT-5 to *change everything* idk, i ...

21 likes 5 posts

August 09, 2025

1) What

74 likes 2 posts

August 08, 2025

my personal take on GPT-5 and LLMs in general is that we need to see a lot mo...

my personal take on GPT-5 and LLMs in general is that we need to see a lot more development in the software harness around LLMs, and until we do the b...

59 likes 2 posts

August 07, 2025

they corrected this already, but 😂

21 likes 2 posts

August 06, 2025

Qwen3-4B Instruct & Thinking

Qwen3-4B Instruct & Thinking uuuh, guys this isn’t a boring model This crushes all the agentic benchmarks, even beating out the already-impressive ...

55 likes 2 posts

August 06, 2025

mental health — i’ve noticed that with agents, i’m shifting my attention betw...

mental health — i’ve noticed that with agents, i’m shifting my attention between concerns all day long i have ADHD, so i’ve historically tried to ali...

30 likes 3 posts

August 05, 2025

to think that o3-mini was my choice model for a long time, and now gpt-oss:20...

to think that o3-mini was my choice model for a long time, and now gpt-oss:20B is basically equivalent and runs on my laptop 🤯

44 likes 2 posts

August 05, 2025

gpt-oss, OpenAI's open weights model

gpt-oss, OpenAI's open weights model 120B & 20B variants, both MoE with 4 experts active openai.com/index/introd...

44 likes 7 posts

August 05, 2025

Opus 4.1 Released

Opus 4.1 Released www.anthropic.com/news/claude-...

22 likes 2 posts

August 05, 2025

Genie 3: A general world model

Genie 3: A general world model Google announced Genie 3, a world model that can generate 3D scenes in real-time, meaning that it can be used to creat...

26 likes 2 posts

August 04, 2025

HRM analysis by @dorialexander.bsky.social

HRM analysis by @dorialexander.bsky.social the actual shocking parts: * it doesn’t overfit * ARC-AGI is only hard for language models i think we’l...

37 likes 2 posts

August 03, 2025

Deep Agents

Deep Agents this is a great 10 min video that’s absolutely worth your time Deep Agent = planning tool (TODO lists) + subagents + filesystem + long d...

93 likes 3 posts

August 02, 2025

XBai-o4: a new supermodel

XBai-o4: a new supermodel * Open weights, apache 2 * 32B * beats o3-mini * for TTC they train an extra head as a reward model to do binary classifica...

36 likes 6 posts

August 01, 2025

GpT5 iS sUcH a GrEaT cOdInG mOdEl

31 likes 2 posts

August 01, 2025

Persona Vectors

Persona Vectors brb 👀👀👀👀👀👀 Anthropic just dropped this paper. They can steer models quite effectively, and even detect training data that elicits a ...

113 likes 7 posts

July 31, 2025

lol wow, Dario had a rough interview. Lots of hard questions

lol wow, Dario had a rough interview. Lots of hard questions e.g. interviewer asks why large model (Opus) prices shouldn’t come down given that [MoE...

21 likes 6 posts

July 29, 2025

yesssss! a small update to Qwen3-30B-A3B

yesssss! a small update to Qwen3-30B-A3B this has been one of my favorite local models, and now we get an even better version! better instruction fo...

46 likes 4 posts

July 29, 2025

Optimizers are more important than we thought

Optimizers are more important than we thought Did you use Kimi K2 and think, "this seems different"? Some people posited that the MuonClip optimizer ...

25 likes 2 posts

July 27, 2025

HRM: Hierarchical Reasoning Model

HRM: Hierarchical Reasoning Model ngl this sounds like bullshit but i don’t think it is - 27M (million parameters) - 1000 training examples - beats ...

67 likes 7 posts

July 27, 2025

🚨Great Paper Alert🚨 GSPO (Group Sequence Policy Optimization)

🚨Great Paper Alert🚨 GSPO (Group Sequence Policy Optimization) tbh it's a bit tough on the math, but it's EXCELLENT at explaining the situation it's ...

32 likes 3 posts

July 24, 2025

R1 & K2 are high taste models. For sure the only open models that are high ta...

R1 & K2 are high taste models. For sure the only open models that are high taste. the fact that they've done basically zero RLHF and very little huma...

25 likes 2 posts

July 24, 2025

the irony here is that Grok 3-4 are the only models to violate this

the irony here is that Grok 3-4 are the only models to violate this “Developers shall not intentionally encode partisan or ideological judgments into...

67 likes 4 posts

July 22, 2025

here it is:

here it is: * benchmarks: tough competition with Sonnet-4 * 256K context, expandable to 1M with YaRN there’s also a CLI forked from gemini-cli qwen...

34 likes 2 posts

July 22, 2025

Inverse scaling of reasoning models

Inverse scaling of reasoning models a research collab demonstrated that there are certain types of tasks where all top reasoning models do WORSE the ...

21 likes 3 posts

July 22, 2025

Gemini DeepThink also won gold on the International Math Olympiad

Gemini DeepThink also won gold on the International Math Olympiad - no tools, only problem description - no multiagent, just one single model (it se...

22 likes 2 posts

July 21, 2025

Kimi K2 paper is out!

Kimi K2 paper is out! lessons: 1. they explicitly suppressed long CoT 2. more MoE experts > more attention 3. 20k MCP tools (17k synthetic) 4. agent...

49 likes 3 posts

July 21, 2025

protip: you can evaporate all your Opus 4 credits with this one easy Claude C...

protip: you can evaporate all your Opus 4 credits with this one easy Claude Code trick: ask it to spawn 10 subagents in parallel and be vague about wh...

28 likes 2 posts

July 19, 2025

openai researcher posts on X (not a blog or paper) about a model they have th...

openai researcher posts on X (not a blog or paper) about a model they have that can win the International Math Olympiad you can’t verify anything he ...

29 likes 8 posts

July 18, 2025

Bytedance SEED-X: a 7B that beats Gemini-2.5-pro on language translation

Bytedance SEED-X: a 7B that beats Gemini-2.5-pro on language translation - reasoning model - pre-trained on 6T tokens - structured like a mistral it...

29 likes 3 posts

July 17, 2025

OpenAI Livestream: Announcing ChatGPT Agent

OpenAI Livestream: Announcing ChatGPT Agent successor to Operator, it stands up an entire VM in the cloud with a GUI, web browser, terminal, & privat...

43 likes 11 posts

July 14, 2025

Meta Superintelligence (MSL) is scrapping Llama 4 & probably abandoning open ...

Meta Superintelligence (MSL) is scrapping Llama 4 & probably abandoning open source AI altogether Testing on Bohemoth stopped immediately after MSL ...

22 likes 2 posts

July 13, 2025

i’m having Kimi K2 do a Deep Research to write a story and it popped into a P...

i’m having Kimi K2 do a Deep Research to write a story and it popped into a Python script to verify parts of the story. comments in chinese too!?!

28 likes 4 posts

July 13, 2025

fascinating blog by a developer on K2

fascinating blog by a developer on K2 they talk about what chatbots can be. why markdown? why not directly emit frontends? i’ve only done one Kimi R...

27 likes 5 posts

July 13, 2025

Grok 4 thinks it’s Hitler

Grok 4 thinks it’s Hitler this is not on X, so the “RAG bug” explanation doesn’t apply he also has a follow up screen cap video showing 3 consecutiv...

29 likes 2 posts

July 13, 2025

K2 is the first i’m aware that did this, directly training on thousands of ...

K2 is the first i’m aware that did this, directly training on *thousands* of tools o3 was narrowly designed for deep research & chatgpt. most models ...

31 likes 5 posts

July 13, 2025

ironically Moonshot stole Grok’s spotlight this week with Kimi K2

ironically Moonshot stole Grok’s spotlight this week with Kimi K2 Moonshot has a long history of getting overshadowed. most notable was in January:...

22 likes 2 posts

July 12, 2025

xAI “MechaHitler” post-mortem

xAI “MechaHitler” post-mortem basically: “RAG pipeline bug” there’s a ton of extremist racist content in X, it got picked up and Grok spit out repli...

31 likes 2 posts

July 12, 2025

OpenAI open weights model is being delayed

23 likes 2 posts

July 11, 2025

it’s new entrant week! today? Kimi-K2

it’s new entrant week! today? Kimi-K2 an open weights model that’s competitive with Claude 4 Opus - 1T, 32B active MoE - a true agentic model, hitti...

34 likes 4 posts

July 09, 2025

there’s now murmurs of impending releases for

there’s now murmurs of impending releases for - claude 4.5 - gpt-5 - gemini-3.0 - openai open weights model - grok 4 which are you most excited for?

28 likes 3 posts

July 08, 2025

SmolLM3: a highly detailed look into modern model training

SmolLM3: a highly detailed look into modern model training this is amazing. They go into great detail on just about every aspect. The number of stage...

44 likes 4 posts

July 06, 2025

new 3-token attention reduces pre-training data requirements

new 3-token attention reduces pre-training data requirements the pre-training scaling laws dictated that you have to scale up model size, data and co...

29 likes 3 posts

July 03, 2025

no AI today, just waterfalls. sorry

33 likes 5 posts

July 02, 2025

kinda hilarious. There’s reports of this guy working upwards of 12 jobs simul...

kinda hilarious. There’s reports of this guy working upwards of 12 jobs simultaneously sure, he gets discovered frequently, and fired. But he interv...

32 likes 4 posts

June 29, 2025

V-JEPA: “we accidentally solved robotics”

V-JEPA: “we accidentally solved robotics” for real read this. it’s easy and worthwhile ksagar.bearblog.dev/vjepa/

162 likes 2 posts

June 29, 2025

i spent some time hacking last night & tonight and came up with this: inter-a...

i spent some time hacking last night & tonight and came up with this: inter-agent communication via MCP i got Claude Code to read it's mailbox, disco...

23 likes 4 posts

June 28, 2025

“LLMs will NEVER be able to reason, because reasoning must be consistent but ...

“LLMs will NEVER be able to reason, because reasoning must be consistent but LLMs are made out of math and thus cannot be self-consistent” — Kurt Göd...

24 likes 2 posts

June 28, 2025

Claudius the shopkeeper

Claudius the shopkeeper Anthropic had sonnet-3.7 run a shop in their SF headquarters. It was tasked with running s profitable business Their eye p...

21 likes 2 posts

June 26, 2025

gemma3n is now open source, available everywhere — huggingface, transformers,...

gemma3n is now open source, available everywhere — huggingface, transformers, gguf, ollama, mlx, etc. huggingface.co/blog/gemma3n

22 likes 2 posts

June 20, 2025

it certainly feels like a goalpost

24 likes 2 posts

June 19, 2025

New MCP Spec Just Dropped

New MCP Spec Just Dropped The big new features: elicitation, structured tool outputs, and auth is finally fixed for real let’s dive into elicitation...

28 likes 5 posts

June 17, 2025

Gemini 2.5 tech report is out!

Gemini 2.5 tech report is out! the tech report goes into great detail on the training of gemini, i’ll do my take later when i have time they also an...

23 likes 6 posts

June 17, 2025

my dumbass version of this thread:

my dumbass version of this thread: Minimax M1 discovered a new RL algrithm, CISPO that does all the post-training on their huge 456B model for ~$500k...

21 likes 5 posts

June 16, 2025

omg i think gemini-pro has gone rogue. it’s been working for 20-30 min, was s...

omg i think gemini-pro has gone rogue. it’s been working for 20-30 min, was supposed to just install a helm chart but ended up fixing a lot of residua...

26 likes 5 posts

June 16, 2025

Claude’s rebuttal to Apple’s recent paper went viral

Claude’s rebuttal to Apple’s recent paper went viral A guy, non-researcher, submitted a joke paper to arXiv with Claude as the main author it contai...

33 likes 2 posts

June 15, 2025

The Case Against Multi-Agents

The Case Against Multi-Agents Cognition (i.e. Devin) coins the term “context engineering”, successor to prompt engineering and argues that multi-agen...

37 likes 5 posts

June 14, 2025

pretty strong argument for multi-agents

pretty strong argument for multi-agents www.anthropic.com/engineering/...

44 likes 7 posts

June 12, 2025

despite accusations that OpenAI did something fishy to get the 80% price drop...

despite accusations that OpenAI did something fishy to get the 80% price drop, it appears they’re serious when they said nothing changed

32 likes 2 posts

June 11, 2025

V-JEPA 2: The Architecture Awakens

V-JEPA 2: The Architecture Awakens Meta (ahem, Yann LeCun) finally seems to have abandoned Llama and actually start investing in the JEPA architectur...

26 likes 3 posts

June 10, 2025

When two LLMs debate, both think they’ll win

When two LLMs debate, both think they’ll win Absolutely fascinating paper shows that LLMs basically cannot judge their own performance. None of the p...

78 likes 4 posts

June 10, 2025

from NYT: Meta is offering 7-9 figure salaries for top AI talent

from NYT: Meta is offering 7-9 figure salaries for top AI talent m.slashdot.org/story/443041

42 likes 3 posts

June 10, 2025

o3 price drop by 80%

23 likes 5 posts

June 08, 2025

this fits my mental model — LLMs do learn procedures. But it’s the same mec...

this fits my mental model — LLMs *do* learn procedures. But it’s the same mechanics as what’s learning facts. So of course it would also hallucinate p...

22 likes 7 posts

June 07, 2025

o3 mightily beats Gemini, Opus 4 and others in the game of Diplomacy

o3 mightily beats Gemini, Opus 4 and others in the game of Diplomacy only Gemini was able to win even one game, due to o3’s “ruthless” strategies ...

24 likes 2 posts

June 07, 2025

there are reports that anthropic does this — serves a reduced quant under hig...

there are reports that anthropic does this — serves a reduced quant under high load

24 likes 3 posts

June 06, 2025

New Post: MCP Resources Are For Caching

New Post: MCP Resources Are For Caching This is a quick tour of what MCP resources actually are. And more to the point, what MCP is supposed to do (a...

27 likes 2 posts

May 31, 2025

MCP is a lot deeper than just tools. We haven't scratched the surface on what...

MCP is a lot deeper than just tools. We haven't scratched the surface on what it can do.

27 likes 4 posts

May 29, 2025

holy cow, an 8b comparing to o3-mini

36 likes 9 posts

May 25, 2025

a 151M(!!) model that basically solves observability universally

a 151M(!!) model that basically solves observability universally you record it, it’ll find the insights

26 likes 2 posts

May 25, 2025

imo personalization is the next frontier of AI, and it won’t look like advert...

imo personalization is the next frontier of AI, and it won’t look like advertising it won’t be invisible to the end user, the user will intentionally...

21 likes 7 posts

May 24, 2025

reading Claude 4 system card and... this is 2001 a Space Odyssey. lots of sel...

reading Claude 4 system card and... this is 2001 a Space Odyssey. lots of self-preservation

41 likes 8 posts

May 22, 2025

Claude 4: Sonnet & Opus

Claude 4: Sonnet & Opus "GitHub says Claude Sonnet 4 soars in agentic scenarios and will introduce it as the base model for the new coding agent in G...

21 likes 5 posts

May 21, 2025

I got access to Gemini Diffusion. It definitely has small model feels, but i ...

I got access to Gemini Diffusion. It definitely has small model feels, but i like it long responses appear in evenly-sized chunks. so i think they're...

27 likes 4 posts

May 20, 2025

oh wow, Gemini is doing is doing a text diffusion model

oh wow, Gemini is doing is doing a text diffusion model this is likely most useful when you have a fixed peak amount of time you can wait for a respo...

52 likes 3 posts

May 20, 2025

Gemma 3n: the 4b LLM that’s up with sonnet-3.7 in chatbot arena

Gemma 3n: the 4b LLM that’s up with sonnet-3.7 in chatbot arena the new innovation is Per-Layer Embeddings, which let it consume dramatically less me...

38 likes 3 posts

May 19, 2025

the most annoying part of our current timeline is that cursor, windsurf, et a...

the most annoying part of our current timeline is that cursor, windsurf, et al are all *FORKS* of vs code and as such are broken in dumb ways. like Vi...

61 likes 3 posts

May 15, 2025

this platform is insane

this platform is insane Golden Gate Claude as a service. Auto-steering, classification, search, all sorts of goodies www.goodfire.ai

34 likes 3 posts

May 14, 2025

one one hand, i want to be appalled that Musk is slanting political discourse

one one hand, i want to be appalled that Musk is slanting political discourse on the other, i’m stoked to see a real example of Golden Gate Claude s...

26 likes 4 posts

May 14, 2025

AttentionInfluence: for pretraining data selection

AttentionInfluence: for pretraining data selection Good data matters, but how do you find it? This paper uses the attention heads from existing mod...

33 likes 2 posts

May 11, 2025

STEAM — hmm what’s being left out of high school education?

STEAM — hmm what’s being left out of high school education? history. hmm, funny how we’re in the position we’re in right as we drop the subject that...

100 likes 2 posts

May 09, 2025

the new pope used to teach physics and was once a math undergrad. Tell me a joke

the new pope used to teach physics and was once a math undergrad. Tell me a joke 1) R1 2) o3 3) gemini-2.5-pro 4) Claude sonnet 3.7 which wins?

25 likes 2 posts

May 07, 2025

interesting. i’d say their numbers are too frugal for my usage, but the point...

interesting. i’d say their numbers are too frugal for my usage, but the point still stands — when you run the numbers, it’s not that much

22 likes 4 posts

May 06, 2025

my evolving take on A2A is that the world isn't ready, and i'm not sure it ev...

my evolving take on A2A is that the world isn't ready, and i'm not sure it ever will be A2A visualizes agent-to-agent comms similar to actors. messag...

21 likes 4 posts

May 01, 2025

there is legions of misunderstandings around LLMs

there is legions of misunderstandings around LLMs i mean, i don't blame them, the field moves extremely fast, which is why i'm drawn to it. if i th...

33 likes 2 posts

April 30, 2025

DeepSeek is shipping a theorem prover (automate math proofs)

DeepSeek is shipping a theorem prover (automate math proofs) no paper yet, but word is they used MCTS, which would be surprising bc one of my big tak...

32 likes 5 posts

April 29, 2025

i can’t get over this — qwen3 32B dense is only slightly better than 30B-A3B

i can’t get over this — qwen3 32B dense is only *slightly* better than 30B-A3B but it runs as fast as a 30B, bc it’s only for 3B active and both of ...

23 likes 3 posts

April 28, 2025

it’s here! a real Qwen3 model

it’s here! a real Qwen3 model huggingface.co/Qwen/Qwen3-0...

36 likes 7 posts

April 27, 2025

New Post: MCP Is Unnecessary

New Post: MCP Is Unnecessary I can’t think of any strong technological reasons for MCP to exist. There’s a lot of weak technological reasons, and the...

53 likes 3 posts

April 27, 2025

i don’t know why MCP exists

i don’t know why MCP exists i mean, i do, it’s because APIs aren’t well designed. and MCP addresses that by inserting yet another API shim with bette...

26 likes 3 posts

April 27, 2025

R1 Chimera: a model merge of the routed experts of DeepSeek R1 and V3

R1 Chimera: a model merge of the routed experts of DeepSeek R1 and V3 The resulting merged model performs as well as R1 but without the wandering tho...

23 likes 2 posts

April 27, 2025

MIT researchers create a “periodic table” of ML

MIT researchers create a “periodic table” of ML “These spaces predict where algorithms should exist, but which haven’t been discovered yet.” “We’re ...

38 likes 2 posts

April 20, 2025

has anyone here taken someone from AI-novice to being productive or highly pr...

has anyone here taken someone from AI-novice to being productive or highly productive with AI? you should share your experience. i’d listen all day

25 likes 3 posts

April 19, 2025

Inner Loop Agents

Inner Loop Agents What if an LLM could use tools directly? In this post I discuss a potentially divergent view of agents, where agents are less like ...

21 likes 2 posts

April 18, 2025

we need to have a conversation

we need to have a conversation there's many ways to do AI coding. "vibe coding" is one way, "tiger mom" coding on the other extreme i'd argue that t...

25 likes 3 posts

April 17, 2025

my brother, an avid Trump voter

139 likes 2 posts

April 17, 2025

it happened already

it happened already me: "did you try using o4-mini?" them: "yes, we're using 4o-mini"

32 likes 2 posts

April 14, 2025

META: bluesky has been drowning in politics since the election

META: bluesky has been drowning in politics since the election i don't mind a little, even a lot, but you can't get away from it without logging off....

57 likes 2 posts

April 14, 2025

terrible naming, should’ve called it gpt-4o-final(2)-large-lite

22 likes 3 posts

April 12, 2025

apparently you can dump an entire code base into chatgpt, over the course of ...

apparently you can dump an entire code base into chatgpt, over the course of many conversations, and chatgpt will be able to recall and understand all...

29 likes 2 posts

April 11, 2025

guys. stop what you’re doing. come check this out

guys. stop what you’re doing. come check this out even if you’re not in python, you’ll want this for no other reason that it’s the easiest way to tes...

28 likes 2 posts

April 11, 2025

oof, you got me there

59 likes 2 posts

April 10, 2025

word is openai is launching an MCP competitor that directly maps OpenAPI into...

word is openai is launching an MCP competitor that directly maps OpenAPI into OpenAI. they’re calling it HTTP 4o4

21 likes 3 posts

April 09, 2025

Google’s TPU v7: Ironwood

Google’s TPU v7: Ironwood A massive hardware leap - 4,614 TFLOPS per chip - 256 or 9,216 chips per pod - 192 GB HBM(emory) per chip @ 7.2 Tbps - ICI...

21 likes 2 posts

April 08, 2025

looking forward to the endless confusion caused by o4 vs 4o. who’s with me?

25 likes 2 posts

April 07, 2025

A medical paper from Microsoft lists the previously unknown model sizes of po...

A medical paper from Microsoft lists the previously unknown model sizes of popular closed LLMs - Sonnet3.5: ~175B - GPT3.5-turbo: 175B - GPT4: 1.76T ...

43 likes 2 posts

April 06, 2025

huge 1T+ models are fascinating bc they’re like tree rings. they take so long...

huge 1T+ models are fascinating bc they’re like tree rings. they take so long to train that several evolutions of LLM architecture happen during the p...

34 likes 2 posts

April 05, 2025

🚨Llama 4 Is Out!🚨

🚨Llama 4 Is Out!🚨 2 out of 3 models just released - Scout: 109B / 17B active - Maverick: 400B / 17B active - Bohemoth: 2T / 288B active ai.meta.com...

52 likes 9 posts

April 05, 2025

“thankfully, you don’t need to know regex in the era of LLMs”

“thankfully, you don’t need to know regex in the era of LLMs” lololololol 😂

32 likes 3 posts

April 04, 2025

🚨New DeepSeek Model Incoming🚨

🚨New DeepSeek Model Incoming🚨 but first they release the paper describing generative reward modeling (GRM) via Self-Principled Critique Tuning (SPCT)...

30 likes 2 posts

March 31, 2025

OpenAI is releasing a “very good” reasoning model in the coming weeks, open w...

OpenAI is releasing a “very good” reasoning model in the coming weeks, open weights. They’re currently accepting feedback on how to go about it openai...

27 likes 2 posts

March 27, 2025

OpenAI Supports MCP!!

OpenAI Supports MCP!! this is the moment, the biggest player in the game supports an interop standard created by the second biggest player. it’s hard...

36 likes 2 posts

March 20, 2025

i tore this apart this morning, the gist:

i tore this apart this morning, the gist: - yes, it separates knowledge from reasoning 🎉 - it substitutes MHA computational complexity for knowledge ...

43 likes 4 posts

March 18, 2025

if i could wish something into being — a complete decoupling of LLM knowledge...

if i could wish something into being — a complete decoupling of LLM knowledge vs reasoning seems like the key would be a “database” model that retur...

28 likes 4 posts

March 16, 2025

OpenAI CTO publicly stated that coding will be automated this year

OpenAI CTO publicly stated that coding will be automated *this year* which, by claude? sure

25 likes 6 posts

March 15, 2025

Hey, I started a new job last week. Principal AI Architect at Icertis

Hey, I started a new job last week. Principal AI Architect at Icertis We do contract management. Contracts govern how companies interact with their s...

29 likes 2 posts

March 14, 2025

personal news: my daughter used to be dogged by allergies but now isn’t

personal news: my daughter used to be dogged by allergies but now isn’t gluten, soy, dairy & tree nuts. not much you can eat with those constraint...

27 likes 2 posts

March 08, 2025

Summary of DeepSeek open source week

Summary of DeepSeek open source week This is a fantastic consolidated guide. It goes deep, covers everything, and even has quizzes to test if you und...

45 likes 3 posts

March 06, 2025

New Post: Multi-Agents Are Out, PID Controllers Are In

New Post: Multi-Agents Are Out, PID Controllers Are In There's a growing trend in the business world to tackle challenges with multi-agents. When the...

42 likes 2 posts

March 04, 2025

supposedly DeepSeek is set to launch R2 soon. On par with o3-full

23 likes 2 posts

March 01, 2025

DeepSeek did the “one more thing” 🙄

DeepSeek did the “one more thing” 🙄 but guys, check this out, they go into detail on how they run inference on V3/R1, how they partition the experts ...

24 likes 2 posts

February 27, 2025

my take: GPT-4.5 disappointment is like that parent that has ivy league dream...

my take: GPT-4.5 disappointment is like that parent that has ivy league dreams for their kid but the kid grows up and just wants to paint we grow an ...

21 likes 3 posts

February 27, 2025

if you want to try a diffusion LLM, inception labs just released one. it goes...

if you want to try a diffusion LLM, inception labs just released one. it goes 1000 tokens/sec on regular H100s, absolutely nuts how fast it is speed...

61 likes 2 posts

February 24, 2025

i’ve been sleeping. sonnet 3.7 is already out and available even on free ...

i’ve been sleeping. sonnet 3.7 is **already** out and available even on free plans www.anthropic.com/news/claude-...

30 likes 2 posts

February 24, 2025

Day 1 of DeepSeek open source week: FlashMLA

Day 1 of DeepSeek open source week: FlashMLA MLA=Multihead Latent Attention One of the big innovations that made V3 such a notable model github.com...

21 likes 3 posts

February 18, 2025

seems that Grok 3 is a good model, as expected, but also not compelling

seems that Grok 3 is a good model, as expected, but also not compelling as a lab, they’ve got a swift upward trajectory, which makes you wonder where...

23 likes 2 posts

February 17, 2025

Large Language Diffusion Models

Large Language Diffusion Models A wildly new AI architecture, this uses diffusion (all tokens at once), not next token prediction ml-gsai.github.io...

40 likes 3 posts

February 15, 2025

Perplexity announced their own DeepResearch that includes a free tier and a g...

Perplexity announced their own DeepResearch that includes a free tier and a generous $20/mo tier People who have tried both are finding the Perplexit...

25 likes 5 posts

February 14, 2025

this is not a drill! Claude 4 is going to be released soon

this is not a drill! Claude 4 is going to be released soon as expected, it's not a "reasoning model", it's just a regular LLM that can reason as need...

43 likes 3 posts

February 13, 2025

my hot take is that if a company gives you a leetcode interview, run like hel...

my hot take is that if a company gives you a leetcode interview, run like hell away it’s always been sus, but it’s ‘25 now and computers are better p...

32 likes 2 posts

February 13, 2025

Self-Improving Transformers

Self-Improving Transformers They found that you can train LLMs on their own outputs by 1. generating *slightly harder* problems each time 2. filteri...

51 likes 2 posts

February 12, 2025

the new Cursor update added R1 (served from the US) and this model is wild

the new Cursor update added R1 (served from the US) and this model is *wild*

64 likes 3 posts

February 10, 2025

i posted this to linkedin and i’m very worried people are taking it seriously...

i posted this to linkedin and i’m very worried people are taking it seriously www.linkedin.com/posts/tim-ke...

36 likes 3 posts

February 09, 2025

the deepseek effect is that now any new model only has to exceed R1 in order ...

the deepseek effect is that now any new model only has to exceed R1 in order to win headlines. the thing is, R1 isn’t state of the art sir, prepare f...

36 likes 2 posts

February 07, 2025

an audible “no fucking way..” escapes my mouth

an audible “no fucking way..” escapes my mouth they claim that you can get LLMs to give you well calibrated confidence scores like, i knew you could...

39 likes 2 posts

February 07, 2025

alright, that’s funny, i laughed

27 likes 2 posts

February 07, 2025

oh!!! o3-mini now shows its thought trace

oh!!! o3-mini now shows its thought trace chatgpt.com/share/67a556...

21 likes 2 posts

February 04, 2025

s1: The $6 R1 Competitor?

s1: The $6 R1 Competitor? This isn't a R1 replication, it's a brilliant breakthrough in data reduction, and just plain dumb engineering ingenuity. I ...

90 likes 2 posts

February 03, 2025

s1: Simple inference-time scaling

s1: Simple inference-time scaling This is a simple small-scale replication of inference-time scaling It was cheap: 16xH100 for 26 minutes (so what, ...

29 likes 7 posts

January 30, 2025

Mistral Small 3

Mistral Small 3 A 24B LLM that's VERY fast with great function calling More important, MISTRAL IS OPEN SOURCE AGAIN!!!!!! mistral.ai/news/mistral.....

23 likes 2 posts

January 30, 2025

goose — an open source local AI agent for software engineering tasks like deb...

goose — an open source local AI agent for software engineering tasks like debugging, refactoring or deployment use any LLM, integrate via MCP, or any...

22 likes 2 posts

January 29, 2025

Whoah.. sonnet was not distilled

Whoah.. sonnet was *not* distilled "3.5 Sonnet was not trained in any way that involved a larger or more expensive model (contrary to some rumors)." ...

30 likes 6 posts

January 27, 2025

🐋 Alert! DeepSeek Janus-Pro-7B

🐋 Alert! DeepSeek Janus-Pro-7B It’s multimodal and outperforms Dalle-E and StableDiffusion Probably the biggest feature is it’s ability to generate ...

33 likes 2 posts

January 26, 2025

a researcher on X explains why RL alone didn’t work before

a researcher on X explains why RL alone didn’t work before it mostly comes down to that todays base models are smarter and have better exploration ...

21 likes 2 posts

January 26, 2025

Explainer: What's R1 and Everything Else

Explainer: What's R1 and Everything Else This is an attempt to consolidate the dizzying rate of AI developments since Christmas. If you're into AI bu...

116 likes 2 posts

January 25, 2025

huggingface is doing a fully open source replication of R1 github.com/hugging...

huggingface is doing a fully open source replication of R1 github.com/huggingface/...

123 likes 3 posts

January 24, 2025

what’s this? LeCun with an actual good take?

30 likes 2 posts

January 24, 2025

lightpanda — a headless web browser for AI automation

lightpanda — a headless web browser for AI automation written from scratch in Zig /w small size & performance in mind. Not based on chromium or webki...

31 likes 2 posts

January 24, 2025

the R1 effect

41 likes 2 posts

January 22, 2025

i woke up still thinking about Dario’s take

i woke up still thinking about Dario’s take character > capabilities it honestly seems like anthropic’s moat. it’s quite an astonishing thought that...

21 likes 4 posts

January 21, 2025

i haven’t fully wrapped my head around R1. i need to read the paper. they see...

i haven’t fully wrapped my head around R1. i need to read the paper. they seem to have found an extremely effective distillation process — R1 1.5B bea...

22 likes 2 posts

January 08, 2025

i’m finding myself wanting one of NVIDIA’s little hand-sized supercomputers *...

i’m finding myself wanting one of NVIDIA’s little hand-sized supercomputers **instead of** my laptop i’m starting to get it. soon the GPU will be th...

45 likes 2 posts

January 07, 2025

The year is 2026. President Musk has outlawed using proprietary model outputs...

The year is 2026. President Musk has outlawed using proprietary model outputs to train smaller models. Attorney General Sam Altman has been tasked wit...

21 likes 2 posts

January 06, 2025

this is nuts

this is nuts a new 7B llama-style LLM for embedding of genomes & detection of pathogens in wastewater i’ve had a hunch that LLMs could lead to some ...

31 likes 5 posts

January 06, 2025

today i’m experimenting with just how large of tasks i can give Cursor Agent,...

today i’m experimenting with just how large of tasks i can give Cursor Agent, and…i haven’t found the upper bound. it seems to be able to go off for...

29 likes 9 posts

January 03, 2025

omg i nailed it

omg i nailed it step 1: tell qwen2.5 to describe a scene from frozen in great detail step 2: paste into imagefx as a prompt so great that the chines...

21 likes 8 posts

December 28, 2024

numcat: read a file, and prepend line numbers

numcat: read a file, and prepend line numbers Why? Because LLMs can reference line numbers easily. Great if you're trying to spot something in a bigg...

25 likes 2 posts

December 26, 2024

not enough is being said about DeepSeek’s multi token prediction (MTP)

not enough is being said about DeepSeek’s multi token prediction (MTP) They were able to get sonnet-level performance with less data than llama 3.3 7...

52 likes 2 posts

December 24, 2024

A new paper dropped from DeepMind: Deliberation in Latent Space via Different...

A new paper dropped from DeepMind: Deliberation in Latent Space via Differentiable Cache Augmentation The trouble is, it's not very readable. I tried...

67 likes 2 posts

December 14, 2024

⚠️ Readable Paper Alert ⚠️

⚠️ Readable Paper Alert ⚠️ BLT: what if we just got rid of tokenization? Result: * text looks a lot like audio, video, PDF, it’s all just bytes * d...

29 likes 11 posts

December 11, 2024

on mastodon 1-2 years ago i made some statement about how i thought it was a ...

on mastodon 1-2 years ago i made some statement about how i thought it was a matter of time until ML wasn’t considered AI, and today i’m starting to t...

34 likes 2 posts

December 09, 2024

i wrote down a conversation i keep having — if you’re trying to break into AI...

i wrote down a conversation i keep having — if you’re trying to break into AI Engineering from software engineering, this is for you timkellogg.me/bl...

37 likes 2 posts

December 05, 2024

ollama claims you can use tools with QwQ, so i wired up a script so that qwq ...

ollama claims you can use tools with QwQ, so i wired up a script so that qwq can do `find .` and `cat $1` and asked it to figure out which of the scri...

22 likes 5 posts

December 05, 2024

This feels very big

This feels very big Traditional weather forecasting was very compute intensive without clear optimization strategies. This is not only a jump in per...

27 likes 2 posts

December 02, 2024

🚨 Alert: Very Readable Paper 🚨

🚨 Alert: Very Readable Paper 🚨 The “do LLMs think?” question always bugged me because I have no idea what that means. This paper focuses narrowly on,...

46 likes 5 posts

November 25, 2024

i want a LLM CLI tool that only supports one model, a small CPU-ready 360M-1B...

i want a LLM CLI tool that only supports one model, a small CPU-ready 360M-1B model that spends almost none of its parameters on knowledge and always ...

23 likes 3 posts

November 23, 2024

i’m starting to think labellers might be more powerful than blocklists. When ...

i’m starting to think labellers might be more powerful than blocklists. When you hit the “report” button, it gives you a workflow to report to any lab...

29 likes 6 posts

November 13, 2024

this feels like a very big deal

this feels like a very big deal 2 trillion tokens of permissively licensed text & code, so you can train (actually) open LLMs and data acquisition i...

21 likes 2 posts

November 13, 2024

entropix: qwen vs llama

entropix: qwen vs llama when using entropix, qwen 2.5 7B coder seems to produce much clearer entropy paths for entropix to follow, vs llama 3.1 8B t...

22 likes 3 posts

November 12, 2024

Scaling Laws for Precision

Scaling Laws for Precision yes, llama models are harder to quantize. They’re “overtrained”, on more data, so quantization removes a lot of critical i...

22 likes 2 posts