TITANS & MIRAS: real continual learning
TITANS & MIRAS: real continual learning MIRAS = a unifying theory of transformers (attention) and state space models (SSM, e.g. Mamba, RNNs) TITANS...
Archived from @timkellogg.me on Bluesky
TITANS & MIRAS: real continual learning MIRAS = a unifying theory of transformers (attention) and state space models (SSM, e.g. Mamba, RNNs) TITANS...
Anthropic is going after enterprises because its predictable they’re going public, not because they need cash from public markets, but because enterp...
i thought this was a generic agent product for interviewing people it’s not, it’s just a one-off project to understand people’s perspectives on AI i...
it’s the year of our lord 2025 and auth is still near impossible
OpenAI trained a GPT-5 variant to admit when it took shortcuts openai.com/index/how-co...
Apparently OpenAI plans on releasing “Onion” next week, potentially as GPT-5.2 or GPT-5.5 Shallotpeat = a huge new model, fixing pretraining bugs in ...
you can absolutely build programming & query languages on top of JSON or YAML. You can use it as a lexer to a larger language. They always work becaus...
DSA: DeepSeek Sparse Attention DeepSeek 3.2 & 3.2-Speciale are ridiculously cheap because of DSA LLMs aren’t quadratic anymore They trained an addi...
Claude Opus “soul document” Opus 4.5 was indeed (confirmed) trained with a “soul document”, a prompt included in both supervised & reinforcement lear...
DeepSeek 3.2 2 new models: * 3.2: a open weights GPT-5-High competitor that’s fully agentic * 3.2-Speciale: a maxxed-out version of 3.2 that achiev...
i’m looking at the ChatGPT & Gemini apps, reverse engineering them ChatGPT has a “guardian_tool” where it can fetch policies here’s what mine has, t...
this whole smol model thing that @dorialexander.bsky.social started is reminding me of entropix independent researchers working out in the open on a ...
the experiment continues — recognizable text below 1M parameters(!!) they exposed it to 1B tokens of the SYNTH dataset, probably can train longer
1. yeah, i was annoyed that no one else noticed that he isn’t actually anti pretrain scaling 2. pretrain scaling is practically his idea, so ofc he’s...
i let my 9yo use sora for a few minutes and she used up my quota remixing everything into hamsters otoh my notifications are now clogged with likes, ...
Semianalysis: TPU dominance Fascinating article. They argue that the reason for NVIDIA’s circular investment deals is to intertwine their own fate w...
DeepSeek-Math-V2: self-verification Fascinating paper that explores how to RL but focused on process over outcome It’s sort of similar to a GAN, bu...
Summary — He's got a divergent view of AGI We're all pursuing a single behemoth that is *already* smarter than all humans when it's launched He's pu...
Ilya!!!! www.dwarkesh.com/p/ilya-sutsk...
i reeeally hope this is what it looks like i’d love to hear from Ilya, and also i assume Ilya wouldn’t talk unless he had something interesting to sa...
Opus 4.5 Now 1/3rd the cost, and SOTA in programming Like Gemini 3 Pro, people note that it can see a lot deeper into tough problems. That big mode...
Anthropic has no competitors, because nobody else sells Claude we’re expecting Opus 4.5 soon, and time will tell if they understand this It’s over i...
my benchmark for AI models is how much they change life for me, normally it takes a few weeks to run, but: - Gemini 3 + nano banana is massive, proba...
the biggest reason for fully open models is science, and the downstream effects 1. rebuild it, but with your own domain-specific mid-training data 2....
at AI Engineer Summit workshops the MCP session started, 30 min later Claude Code SDK session started and there was a mass migration over to Claude S...
Wild — regular Gemini 3 searched for stock photos for no other reason than it would *really* spice up this explanation 🤯
one AI trend that’s not fully baked, but will change a huge amount — auto-compaction Sonnet 4.5 introduced it, codemax also does it afaict there’s ...
two views from Anthropic 1. Claude Skills are for collaboration 2. Skills are for continual learning
Nano Banana Pro A reasoning model image generator! - multi turn - google search grounding - interleaved image & text console.cloud.google.com/vert...
i read through ~5 pages of the Olmo 3 tech report.. whoah this is the best and most detailed summary of the current state of SOTA LLM training nanoc...
Olmo 3 7B & 32B base & thinking models @ai2.bsky.social has done it again, fully open models, fully open process seems competitive with Qwen 3, exce...
Evidence that Gemini 3 is very large: 1. the QT 2. Artificial analysis (image) quote: x.com/artificialan... report: artificialanalysis.ai/evaluations...
Trying out antigravity 1. It's an IDE, definitely an IDE. Sure it can probably do more, but it's an IDE 2. You can interrupt it without stopping it ...
even MORE??
Google Antigravity: an agentic-first software IDE a ground-up redesign of software development with agents at the center of antigravity.google
Gemini 3 model card leaked the URL is taken down now, was here: storage.googleapis.com/deepmind-med...
the wild part — this probably doesn’t have anything to do with AI or LLM architecture e.g. the gospels of Matthew, Mark, Luke & John are all rephras...
Physics of Language Models: Part 3.1 If you show a fact to an LLM in pre-training once, it’ll memorize the form but not the fact itself. but if you...
codex truncated tool calls apparently it’s because the tokenizer isn’t open source, isn’t compatible with the one that tiktoken ships the API call t...
impressions so far, ~1 hour in * K2 in Claude Code is slow, and doesn't have a dramatically different feel from other models * Rust with AI seems fi...
lettuce begin
Google followed scaling laws and scaled up Gemini 3 OpenAI abandoned scaling laws for GPT-5 soon we’ll see who was right. OpenAI’s is surely cheaper...
Sparse Circuits a new mech interp paper from OpenAI proposes a way to train models so that they’re natively easier to understand openai.com/index/un...
GPT-5.1: a personality upgrade Instant is now a better conversation partner, and Thinking applies thinking more dynamically * 2x fast for easy reque...
Gemini 3 on par with experts this riveting (surprise!) tale from a determined 18th-19th century historian explains 1. OCR can be difficult, excrucia...
Looped LLMs Unlike RNNs, HRM, etc., this starts with an already pretrained LLM, adds a loop, and trains a little longer it’s a very frugal approach ...
80 layers — for those not paying attention, @dorialexander.bsky.social has been posting for weeks about how small models with deep rather than wide la...
most engineers probably don’t understand the extent to which “tech debt” works as an abstraction there’s absolutely “good tech debt”, and it’s not a...
Surprising: Math requires a lot of memorization Goodfire is at it again! They developed a method similar to PCA that measures how much of an LLM’s ...
notable: they ripped out the silicon that supports training they say: “it’s the age of inference” which, yeah, RL is mostly inference. Continual lea...
closed source now lags open
Kimi K2-Thinking a new leader? moonshotai.github.io/Kimi-K2/thin...
Windsurf Codemaps actually this makes a ton of sense — if vibe coding only works on small/non-complex projects, then the answer is to tackle complexi...
Anthropic Model Depreciation Process Anthropic sweetly asked Sonnet about its preferences in how it wanted to be deprecated in addition: - no, stil...
I added this to my AGENTS.md file (text in alt) and it seems to work well i had an environment error, spun out a new codex-cli to figure it out, it w...
Consistency Training new GDM research notes that both jailbreaking and sycophancy share a common cause — subtle changes in the prompt cause dramatic ...
MCP Colors A riff off of the lethal trifecta for addressing prompt injection, this is a simple heuristic to ensure security at runtime red = untrust...
Cache to Cache: let agents communicate in KV cache latent space Instead of con concatenating the text from one agent into another, just concatenate t...
all y’all who think LLMs are too sycophantic haven’t talked to a 5yo
LLMs can report their own experience the most convincing experiment: they isolated the deception control vector and had it talk about its own conscio...
looked into it, codex-cli switched to Rust because: 1. static binary / easy distribution 2. easier OS-level security controls 3. perf 4. extensibili...
fascinating paper — it explores how LLMs take different sides on various geopolitical disputes based on what language they’re speaking in
MCP servers aren’t that useful for coding agents, imo a little, yes, but the bulk of the utility is for non-coding agent use cases. accessing databas...
these graphs are nuts bf16 has been the only way training has been done for nearly a decade all this year tons of resources have been dumped into RL...
astonishing: using fp16 instead of bf16 results in more stable training runs as well as a smaller performance gap between training & inference this ...
OpenAI — GPT6 will be about continual learning Anthropic — ??? GDM — pushing context out on smaller models Chinese labs — hoards of sparse/long att...
Kimi-Linear: more efficient attention New Moonshot model!! It’s a 48B-A3B acting as an experiment into new long-context efficient attention — a hyb...
automated research
gpt-oss-safeguard 20b & 120b a pair of open weights models that let you enforce custom content moderation policies through prompts they’re reasoning...
one time i talked to an evangelical christian who believed that God works through randomness he thought board games & card games were sinful because...
ah, good call, destroy your best asset. it was holding you back anyway
Cursor made an LLM it’s called Composer, it’s an extremely fast model that was previously available under code name Cheetah it’s an MoE trained in f...
IBM is cooking, apparently Granite-4-Nano is a 1B model that beats Qwen3 1.5B huggingface.co/blog/ibm-gra...
your embeddings are not safe! every prompt directly maps to its embedding and back. they’re isomorphic SipIt is a linear time algorithm for quickly...
goal: get my 5yo to use “isomorphic” in conversation at kindergarten
🚨New Thinking Machines post🚨 i’m sorry, but you can’t skip TM blog posts, those are the rules this one is a phenomenal description of the strengths...
been getting deep into LLM serving and the most frustrating part is that EVERY model has different preferences about everything i’d like to say, “oh...
MiniMax open sources M2 This model has been shaking the benchmarks last week, now that it’s open we see that it’s 230B-A10B and dueling (arguably bea...
there’s LIFO (queue ordering), and FIFO (stack), but have you heard of ADHD ordering? Some are calling it KNN-ordering. You always start the next ea...
the numbers for “short AGI timelines” are suspiciously close to the maximum amount of time a VC is willing to wait for a liquidity event. just saying
ImpossibleBench: detect reward hacking a benchmark that poses impossible tasks to see if LLMs cheat github.com/safety-resea...
the last ~30 minutes of the Dwarkesh+Karpathy podcast is all about education he talks about it as a technical problem to solve — “how to maximize eu...
Karpathy mentioned entropy collapse in LLMs, where they stop doing interesting things because they just don’t have interesting things in their trainin...
TRM reproduction report okay, i’m starting to believe TRM is legit. The 5M does seem to hold up on almost all of its claims crazy. repro report: gi...
you can tell it's a bad one because the dashboard are all green
this is the grossest research i’ve seen today and i wish the contributors nothing but failure in life
imo Dwarkesh is probably the best interviewer in tech, definitely in AI he’s a bit awkward, but that almost makes it better. but his questions are su...
recently we got 1. DeepSeek Sparse Attention (DSA) which solved the cost angle 2. DeepSeek-OCR which solved the performance angle also there’s mem...
Z.ai released a paper very similar to DeepSeek-OCR on the same exact day (a few hours earlier afaict) Glyph is just a framework, not a model, but the...
everyone, including Karpathy, is explaining DeepSeek-OCR as a victory of pixels over unicode but imo its encoder-decoder over decoder-only transforme...
DeepSeek-OCR on handwriting still not as good as a pharmacist
i think this is the crux of DeepSeek-OCR 1. (text) context gets longer as you add words 2. long context is quadratic 3. you can fit lots of words in ...
this paper deserves a deep breath and a slow exhaling “what the fuck” who even talks about compression in OCR models? who tries to spin an OCR model...
DeepSeek-OCR a tiny 3B-A0.5B MoE OCR model that runs fast on a single A100 40GB with very high precision and excellent compression why it’s cool — t...
in the Andrej Karpathy interview, he says that the code produced by AI is slop i think we’ve arrived at “slop is in-distribution”, boring because it’...
i listened to a lot of this while shuttling around kids earlier the “RL is terrible” part was fascinating. like yeah, it’s kinda ridiculous. imagine...
mxbai-edge-colbert-v0: tiny long context multi-vector embedding models This report is huge, it gives us: - Apache 2 17M(!!) and 32M models - tops Lo...
the nearest term AI thing that honestly scares me is continual learning mainly that now there’s a data asset that’s maintained and is a barrier to sw...
Gemma 27B variant discovered a new cancer pathway treatment that has been validated Scientists setup an environment and context, the model made a nov...
somewhere along the line, and i don't think it happened recently, LLMs got good at math and i can trust them to do fairly complex computations on thei...
more movement to agentic behavior & computer use cheaper than sonnet but on par (slightly better than) Sonnet 4.0
a major source of stress for me stems from: 1. “Tim” and “team” sound the same to me in most Indian accents 2. Indians like to say things like, “oh, ...
correct i’ve been saying this for a couple months. RL is driving towards specialization my hunch is it’s temporary and something will shift again b...
Firefox users can set Perplexity as their default web search
Is 32B-4bit equal to 16B-8bit? Depends on the task * math: precision matters * knowledge: effective param count is more important * 4B-8bit threshol...
has anyone used AI to fix a bug or add a feature to OSS software just for your own personal use? feels like that should be happening a lot now
UserLM-8B: an LLM that mimics a person Microsoft fine tuned an LLM to respond as the user instead of as an assistant This is useful anytime you nee...
it is absolutely wild to me that free-threaded python is stable, production-worthy they straight-up dropped the GIL. Python used to be not actually m...
i feel like DSPy is beginning to occupy the Haskell tier everyone: "DSPy is great, all LLM programming should be done like this" narrator: "No real ...
Dario Amodei: “100 million words context window is already possible, which is roughly what a human hears in a lifetime. Inference support is the onl...
Muon optimizer learns more from rare data than Adam i need to dig deeper. i think the industry is coalescing on: - Adam is faster - Muon is more sta...
i love this video so much talk of it being the worst time in history, but there’s a huge opportunity to take control over your own life people with ...
Granite-4.0-H-Small: a 32B-A9B MoE Mamba for high efficency Damn! IBM is on the map. The American Qwen? I barely even knew IBM made LLMs, this is sol...
ngl, i did not understand short form AI video. i get it now
ThinkyMachines: Tinker LoRa training API ThinkingMachines announced their first product, telegraphed by a highly detailed blog earlier this week that...
years ago, “data scientist” was a PhD-only position at Big Tech, or just an analyst anywhere else i think the term is now just the latter, and the fo...
RLP: Reinforcement Learning in Pre-Training an NVIDIA paper explores using dense verifier-free RL in pretraining this feels significant. Everything...
Sonnet 4.5 is not fungible Cognition had to rewrite Devin for Sonnet 4.5 because: - it’s aware of its own context length and will aggressively summa...
Sonnet 4.5 Better than Opus 4.1 on almost every benchmark Still the classic Sonnet prices, $3/$15
Does AI get bored? I gave them nothing to do, just to see what happens one thing — they devolve into a repetitive “collapse” state, I guess you coul...
is it good or bad that we’re not obsessing about the Dwarkesh Sutton interview here?
this changes some things
ChatGPT Pulse say some vague wish during chat, and Pulse will try to make it happen while you sleep like a personal assistant
just wasted 30 minutes on bash quoting and this, kids, is why you just use AI
Meta FAIR just released CWD: a dense 32B code world model What’s a Code World Model? Well, it’s trained to know the effect of code, rather than just ...
Dario on why large open weights models are a farce
whoah, TIL
Anthropic full postmortem - long context routing bug - output corruption bug - approx top-k bug on TPU plus: why it was so difficult and slow to fi...
DeepSeek published about R1 in Nature www.nature.com/articles/s41... but the supplementary information is far more detailed. it includes, lots of de...
the top 2 ARC entries are by individuals here, Eric Pang breaks down how he added memory to avoid recomputing learned lessons ctpang.substack.com/...
Qwen Tongyi Deep Research a 32B model that beats other SOTA deep research on many benchmarks tongyi-agent.github.io/blog/introdu...
from the original GSMK8 paper, they project that it would require a 10 quadrillion parameter model to get an 80% on GSMK8 Gemma 3 4B got 89%
what do we think about this? imo this is good practice. you should take ownership if everything the AI writes, otherwise no one owns it and no one is...
Why do LLMs fail at long horizon tasks? Because errors in execution, not planning i.e. an error made early on conditions the LLM into a bad state la...
yo isn’t it wild that LLMs see all text all at once? like, what if you could look at a book, see all pages simultaneously, and be like, “i know king ...
its absurd how much progress there’s been
damn, some seriously good papers came out this week. i started writing a blog summarizing them, but then more came out and i had to read those before ...
NVIDIA’s moat just got bigger the Rubin CTX is for inference, and it’s a beast sold as a rack as the unit, this one optimizes various LLM phases int...
it’s true, the last two Qwens have been beautiful works of art, until you talk to them. gpt-oss same, even GPT-5 to some extent specifically this is ...
Qwen3-Next-80B-A3B Base, Instruct & Thinking - performs similar to Qwen3-235B-A22B - 10% the training cost of Qwen3-32B - 10x throughput of -32B - ou...
“LLMs are deterministic!” no they’re not “well, if you set temperature = 0” still no “and remove floating point calculations?” keep going “and o...
iPhone Air has an A19 Pro chip, which is has native matmul full NVIDIA inference speed in laptops, maybe even phones too? news.ycombinator.com/item?...
"LLMs predict the next token" this is insane, right? predict? as if there's a ground truth? like there's some predetermined state of the universe t...
the owners of TikTok scandalously release: REER: a learning method that exposes the logic that led to a good result Booty: when shaken properly elic...
Claude Code was always impressive to me because i could keep it busy for long periods of time and keep projects moving forward GPT-5 on codex-cli isn...
i cancelled Claude subscription last night. was at $100/mo, now zero. GPT-5 is fine i’ve heard others saying claude code works great on GLM-4.5 for $...
GPT-5 can’t write the seahorse emoji, and is enormously frustrated about that fact since we’re also talking about hallucinations today — notice how i...
Hallucinations are accidentally created by evals They come from post-training. Reasoning models hallucinate more because we do more rigorous post-tra...
i’ve been making a crap ton of scripts with gpt-5-high for the last couple weeks, for training a model and it’s wild its code is basically perfect. t...
Skills are learned through RL! pre-training — individual skills learned post-training (RL) — composed skills learned this is very clarifying to me,...
Kimi K2 0905 (K2.1) released - better front end coding - context increase 128K->256K huggingface.co/moonshotai/K...
new Qwen incoming i’m guessing this will show up in a couple hours. curious if it’ll be 500B+
Automated Curriculum Learning in RL, the key is to progressively tackle harder and harder problems. More can be learned if the problem is “within rea...
i’ve read a ton of resumes in the last few weeks and i’m fairly sure the best way to write them is this - worry less about how many pages - AVOID wal...
NVIDIA stock is about to surge a VLDB paper from Microsoft demonstrates that SQL Server can be both faster and *cheaper* if run on a GPU www.vldb.or...
the GPT-5 system prompt explicitly says not to ask clarifying questions i feel like we’re hitting the bitter lesson on theory of mind. they’re tryin...
vLLM breakdown blog post this is an excellent breakdown of how vLLM works (think ollama but for legit production workloads) great reading if you wan...
progressive thought refinement is a great way to use AI, i do this a lot 1. type long jumbled thoughts 2. AI searches unrealistically hard to make se...
mlx-knife: an ollama-like CLI for Apple Silicon alright, this is the end of the road for me & ollama github.com/mzau/mlx-knife
Longcat-Flash-Chat (560B) uh, holy shit this one is intriguing. bare minimum they compare themselves to all the (actual) top models and do okay but...
Limits of vector search a new GDM paper shows that embeddings can’t represent combinations of concepts well e.g. Dave likes blue trucks AND Ford tru...
The Second Half excellent blog post, i HIGHLY recommend reading it were in the 2nd half of AI 1st: massive pretraining & scale + benchmarks 2nd: R...
oh, this is a wild new take on AI development Prime Intellect offers pre-built and shareable RL environments these are pre-built harnesses to trai...
ok, i’m convinced now. “RAG” is a meaningless term and we should stop using it
Starting with K2, several large “agentic coding” models weren’t trained as reasoners: - K2 - GLM-4.5 (current SOTA open weights) - Opus 4.1 benchmark...
Motif 2.6B — compact model with long context unique: trained on AMD GPUs focus is on long context & low hallucination rate — imo this is a growing g...
As LLMs Improve, People Adapt Their Prompts a study shows that a lot of the real world performance gains that people see are actually because people ...
i’ve gotten the opportunity at work to train a small LLM. my brain has been completely saturated with vacuuming up new information for the last bit
Mirage 2 — Real-time Generative World Model It's like Google's Genie 3 but you can play it right now, in your browser (the queue isn't as long as it ...
i’m coming to the conclusion that most programmers are very bad at using LLMs, even the AI optimist ones i’ve seen a lot of recent data points that t...
Yann LeCun demoted Meta Superintelligence did a massive reorg resulting in Yann reporting directly to Alexander Wang 4 teams: research, product, tra...
Mustafa Suleyman (of Microsoft) takes a stand against model welfare He thinks it’s too early to be planning how to take care of models’ wellbeing ...
FYI you should be using the responses API if you’re using openai it’s a higher level API, so it’s just plain easier to work with, but also they’ve do...
probably just ran out of cubicle tents
ok, so DeepSeek V3.1 does appear to be real. 128K context. unsure what other details are real since they don’t announce the same way western labs do. ...
ngl this sounds like a silly thing to say, but the point is, they’re default alive if they stopped training models today, all VC funding was immediat...
dear GPT-5, i’m pretty sure “storagely” is not a word, but that oddly makes a ton of sense so maybe i’ll start using it
GPT-5 is massively better at offensive cybersecurity (hacking & pen testing) The system card only claimed “moderate increase in risk”, but xbow foun...
HRM confirmed by ARC-AGI team, but also dismissed as non-generalizable the magic wasn’t in the hierarchical structure, it was in the outer loop. And ...
note the date — 2019 this is a reply to OpenAI’s announcement of GPT-1 six years later, how close are we to novelist being an extinct profession?
instead of AGI we got.. gpt-4o withdrawal unexpectedly (or maybe expectedly), users formed a psychological bond with 4o and ripping it away seems to ...
Jan1-4B: a tiny local model that beats Perplexity Pro attach any search-related MCP and use vLLM or llama.cpp, or use the Janus app based on Qwen3 a...
ngl the GPT-5 backlash was surprising. i get the router gripes, but i didn’t anticipate how many people expected GPT-5 to *change everything* idk, i ...
1) What
my personal take on GPT-5 and LLMs in general is that we need to see a lot more development in the software harness around LLMs, and until we do the b...
they corrected this already, but 😂
Qwen3-4B Instruct & Thinking uuuh, guys this isn’t a boring model This crushes all the agentic benchmarks, even beating out the already-impressive ...
mental health — i’ve noticed that with agents, i’m shifting my attention between concerns all day long i have ADHD, so i’ve historically tried to ali...
to think that o3-mini was my choice model for a long time, and now gpt-oss:20B is basically equivalent and runs on my laptop 🤯
gpt-oss, OpenAI's open weights model 120B & 20B variants, both MoE with 4 experts active openai.com/index/introd...
Opus 4.1 Released www.anthropic.com/news/claude-...
Genie 3: A general world model Google announced Genie 3, a world model that can generate 3D scenes in real-time, meaning that it can be used to creat...
HRM analysis by @dorialexander.bsky.social the actual shocking parts: * it doesn’t overfit * ARC-AGI is only hard for language models i think we’l...
Deep Agents this is a great 10 min video that’s absolutely worth your time Deep Agent = planning tool (TODO lists) + subagents + filesystem + long d...
XBai-o4: a new supermodel * Open weights, apache 2 * 32B * beats o3-mini * for TTC they train an extra head as a reward model to do binary classifica...
GpT5 iS sUcH a GrEaT cOdInG mOdEl
Persona Vectors brb 👀👀👀👀👀👀 Anthropic just dropped this paper. They can steer models quite effectively, and even detect training data that elicits a ...
lol wow, Dario had a rough interview. Lots of hard questions e.g. interviewer asks why large model (Opus) prices shouldn’t come down given that [MoE...
yesssss! a small update to Qwen3-30B-A3B this has been one of my favorite local models, and now we get an even better version! better instruction fo...
Optimizers are more important than we thought Did you use Kimi K2 and think, "this seems different"? Some people posited that the MuonClip optimizer ...
HRM: Hierarchical Reasoning Model ngl this sounds like bullshit but i don’t think it is - 27M (million parameters) - 1000 training examples - beats ...
🚨Great Paper Alert🚨 GSPO (Group Sequence Policy Optimization) tbh it's a bit tough on the math, but it's EXCELLENT at explaining the situation it's ...
R1 & K2 are high taste models. For sure the only open models that are high taste. the fact that they've done basically zero RLHF and very little huma...
the irony here is that Grok 3-4 are the only models to violate this “Developers shall not intentionally encode partisan or ideological judgments into...
here it is: * benchmarks: tough competition with Sonnet-4 * 256K context, expandable to 1M with YaRN there’s also a CLI forked from gemini-cli qwen...
Inverse scaling of reasoning models a research collab demonstrated that there are certain types of tasks where all top reasoning models do WORSE the ...
Gemini DeepThink also won gold on the International Math Olympiad - no tools, only problem description - no multiagent, just one single model (it se...
Kimi K2 paper is out! lessons: 1. they explicitly suppressed long CoT 2. more MoE experts > more attention 3. 20k MCP tools (17k synthetic) 4. agent...
protip: you can evaporate all your Opus 4 credits with this one easy Claude Code trick: ask it to spawn 10 subagents in parallel and be vague about wh...
openai researcher posts on X (not a blog or paper) about a model they have that can win the International Math Olympiad you can’t verify anything he ...
Bytedance SEED-X: a 7B that beats Gemini-2.5-pro on language translation - reasoning model - pre-trained on 6T tokens - structured like a mistral it...
OpenAI Livestream: Announcing ChatGPT Agent successor to Operator, it stands up an entire VM in the cloud with a GUI, web browser, terminal, & privat...
Meta Superintelligence (MSL) is scrapping Llama 4 & probably abandoning open source AI altogether Testing on Bohemoth stopped immediately after MSL ...
i’m having Kimi K2 do a Deep Research to write a story and it popped into a Python script to verify parts of the story. comments in chinese too!?!
fascinating blog by a developer on K2 they talk about what chatbots can be. why markdown? why not directly emit frontends? i’ve only done one Kimi R...
Grok 4 thinks it’s Hitler this is not on X, so the “RAG bug” explanation doesn’t apply he also has a follow up screen cap video showing 3 consecutiv...
K2 is the first i’m aware that did this, directly training on *thousands* of tools o3 was narrowly designed for deep research & chatgpt. most models ...
ironically Moonshot stole Grok’s spotlight this week with Kimi K2 Moonshot has a long history of getting overshadowed. most notable was in January:...
xAI “MechaHitler” post-mortem basically: “RAG pipeline bug” there’s a ton of extremist racist content in X, it got picked up and Grok spit out repli...
OpenAI open weights model is being delayed
it’s new entrant week! today? Kimi-K2 an open weights model that’s competitive with Claude 4 Opus - 1T, 32B active MoE - a true agentic model, hitti...
there’s now murmurs of impending releases for - claude 4.5 - gpt-5 - gemini-3.0 - openai open weights model - grok 4 which are you most excited for?
SmolLM3: a highly detailed look into modern model training this is amazing. They go into great detail on just about every aspect. The number of stage...
new 3-token attention reduces pre-training data requirements the pre-training scaling laws dictated that you have to scale up model size, data and co...
no AI today, just waterfalls. sorry
kinda hilarious. There’s reports of this guy working upwards of 12 jobs simultaneously sure, he gets discovered frequently, and fired. But he interv...
V-JEPA: “we accidentally solved robotics” for real read this. it’s easy and worthwhile ksagar.bearblog.dev/vjepa/
i spent some time hacking last night & tonight and came up with this: inter-agent communication via MCP i got Claude Code to read it's mailbox, disco...
“LLMs will NEVER be able to reason, because reasoning must be consistent but LLMs are made out of math and thus cannot be self-consistent” — Kurt Göd...
Claudius the shopkeeper Anthropic had sonnet-3.7 run a shop in their SF headquarters. It was tasked with running s profitable business Their eye p...
gemma3n is now open source, available everywhere — huggingface, transformers, gguf, ollama, mlx, etc. huggingface.co/blog/gemma3n
it certainly feels like a goalpost
New MCP Spec Just Dropped The big new features: elicitation, structured tool outputs, and auth is finally fixed for real let’s dive into elicitation...
Gemini 2.5 tech report is out! the tech report goes into great detail on the training of gemini, i’ll do my take later when i have time they also an...
my dumbass version of this thread: Minimax M1 discovered a new RL algrithm, CISPO that does all the post-training on their huge 456B model for ~$500k...
omg i think gemini-pro has gone rogue. it’s been working for 20-30 min, was supposed to just install a helm chart but ended up fixing a lot of residua...
Claude’s rebuttal to Apple’s recent paper went viral A guy, non-researcher, submitted a joke paper to arXiv with Claude as the main author it contai...
The Case Against Multi-Agents Cognition (i.e. Devin) coins the term “context engineering”, successor to prompt engineering and argues that multi-agen...
pretty strong argument for multi-agents www.anthropic.com/engineering/...
despite accusations that OpenAI did something fishy to get the 80% price drop, it appears they’re serious when they said nothing changed
V-JEPA 2: The Architecture Awakens Meta (ahem, Yann LeCun) finally seems to have abandoned Llama and actually start investing in the JEPA architectur...
When two LLMs debate, both think they’ll win Absolutely fascinating paper shows that LLMs basically cannot judge their own performance. None of the p...
from NYT: Meta is offering 7-9 figure salaries for top AI talent m.slashdot.org/story/443041
o3 price drop by 80%
this fits my mental model — LLMs *do* learn procedures. But it’s the same mechanics as what’s learning facts. So of course it would also hallucinate p...
o3 mightily beats Gemini, Opus 4 and others in the game of Diplomacy only Gemini was able to win even one game, due to o3’s “ruthless” strategies ...
there are reports that anthropic does this — serves a reduced quant under high load
New Post: MCP Resources Are For Caching This is a quick tour of what MCP resources actually are. And more to the point, what MCP is supposed to do (a...
MCP is a lot deeper than just tools. We haven't scratched the surface on what it can do.
holy cow, an 8b comparing to o3-mini
a 151M(!!) model that basically solves observability universally you record it, it’ll find the insights
imo personalization is the next frontier of AI, and it won’t look like advertising it won’t be invisible to the end user, the user will intentionally...
reading Claude 4 system card and... this is 2001 a Space Odyssey. lots of self-preservation
Claude 4: Sonnet & Opus "GitHub says Claude Sonnet 4 soars in agentic scenarios and will introduce it as the base model for the new coding agent in G...
I got access to Gemini Diffusion. It definitely has small model feels, but i like it long responses appear in evenly-sized chunks. so i think they're...
oh wow, Gemini is doing is doing a text diffusion model this is likely most useful when you have a fixed peak amount of time you can wait for a respo...
Gemma 3n: the 4b LLM that’s up with sonnet-3.7 in chatbot arena the new innovation is Per-Layer Embeddings, which let it consume dramatically less me...
the most annoying part of our current timeline is that cursor, windsurf, et al are all *FORKS* of vs code and as such are broken in dumb ways. like Vi...
this platform is insane Golden Gate Claude as a service. Auto-steering, classification, search, all sorts of goodies www.goodfire.ai
one one hand, i want to be appalled that Musk is slanting political discourse on the other, i’m stoked to see a real example of Golden Gate Claude s...
AttentionInfluence: for pretraining data selection Good data matters, but how do you find it? This paper uses the attention heads from existing mod...
STEAM — hmm what’s being left out of high school education? history. hmm, funny how we’re in the position we’re in right as we drop the subject that...
the new pope used to teach physics and was once a math undergrad. Tell me a joke 1) R1 2) o3 3) gemini-2.5-pro 4) Claude sonnet 3.7 which wins?
interesting. i’d say their numbers are too frugal for my usage, but the point still stands — when you run the numbers, it’s not that much
my evolving take on A2A is that the world isn't ready, and i'm not sure it ever will be A2A visualizes agent-to-agent comms similar to actors. messag...
there is legions of misunderstandings around LLMs i mean, i don't blame them, the field moves extremely fast, which is why i'm drawn to it. if i th...
DeepSeek is shipping a theorem prover (automate math proofs) no paper yet, but word is they used MCTS, which would be surprising bc one of my big tak...
i can’t get over this — qwen3 32B dense is only *slightly* better than 30B-A3B but it runs as fast as a 30B, bc it’s only for 3B active and both of ...
it’s here! a real Qwen3 model huggingface.co/Qwen/Qwen3-0...
New Post: MCP Is Unnecessary I can’t think of any strong technological reasons for MCP to exist. There’s a lot of weak technological reasons, and the...
i don’t know why MCP exists i mean, i do, it’s because APIs aren’t well designed. and MCP addresses that by inserting yet another API shim with bette...
R1 Chimera: a model merge of the routed experts of DeepSeek R1 and V3 The resulting merged model performs as well as R1 but without the wandering tho...
MIT researchers create a “periodic table” of ML “These spaces predict where algorithms should exist, but which haven’t been discovered yet.” “We’re ...
has anyone here taken someone from AI-novice to being productive or highly productive with AI? you should share your experience. i’d listen all day
Inner Loop Agents What if an LLM could use tools directly? In this post I discuss a potentially divergent view of agents, where agents are less like ...
we need to have a conversation there's many ways to do AI coding. "vibe coding" is one way, "tiger mom" coding on the other extreme i'd argue that t...
my brother, an avid Trump voter
it happened already me: "did you try using o4-mini?" them: "yes, we're using 4o-mini"
META: bluesky has been drowning in politics since the election i don't mind a little, even a lot, but you can't get away from it without logging off....
terrible naming, should’ve called it gpt-4o-final(2)-large-lite
apparently you can dump an entire code base into chatgpt, over the course of many conversations, and chatgpt will be able to recall and understand all...
guys. stop what you’re doing. come check this out even if you’re not in python, you’ll want this for no other reason that it’s the easiest way to tes...
oof, you got me there
word is openai is launching an MCP competitor that directly maps OpenAPI into OpenAI. they’re calling it HTTP 4o4
Google’s TPU v7: Ironwood A massive hardware leap - 4,614 TFLOPS per chip - 256 or 9,216 chips per pod - 192 GB HBM(emory) per chip @ 7.2 Tbps - ICI...
looking forward to the endless confusion caused by o4 vs 4o. who’s with me?
A medical paper from Microsoft lists the previously unknown model sizes of popular closed LLMs - Sonnet3.5: ~175B - GPT3.5-turbo: 175B - GPT4: 1.76T ...
huge 1T+ models are fascinating bc they’re like tree rings. they take so long to train that several evolutions of LLM architecture happen during the p...
🚨Llama 4 Is Out!🚨 2 out of 3 models just released - Scout: 109B / 17B active - Maverick: 400B / 17B active - Bohemoth: 2T / 288B active ai.meta.com...
“thankfully, you don’t need to know regex in the era of LLMs” lololololol 😂
🚨New DeepSeek Model Incoming🚨 but first they release the paper describing generative reward modeling (GRM) via Self-Principled Critique Tuning (SPCT)...
OpenAI is releasing a “very good” reasoning model in the coming weeks, open weights. They’re currently accepting feedback on how to go about it openai...
OpenAI Supports MCP!! this is the moment, the biggest player in the game supports an interop standard created by the second biggest player. it’s hard...
i tore this apart this morning, the gist: - yes, it separates knowledge from reasoning 🎉 - it substitutes MHA computational complexity for knowledge ...
if i could wish something into being — a complete decoupling of LLM knowledge vs reasoning seems like the key would be a “database” model that retur...
OpenAI CTO publicly stated that coding will be automated *this year* which, by claude? sure
Hey, I started a new job last week. Principal AI Architect at Icertis We do contract management. Contracts govern how companies interact with their s...
personal news: my daughter used to be dogged by allergies but now isn’t gluten, soy, dairy & tree nuts. not much you can eat with those constraint...
Summary of DeepSeek open source week This is a fantastic consolidated guide. It goes deep, covers everything, and even has quizzes to test if you und...
New Post: Multi-Agents Are Out, PID Controllers Are In There's a growing trend in the business world to tackle challenges with multi-agents. When the...
supposedly DeepSeek is set to launch R2 soon. On par with o3-full
DeepSeek did the “one more thing” 🙄 but guys, check this out, they go into detail on how they run inference on V3/R1, how they partition the experts ...
my take: GPT-4.5 disappointment is like that parent that has ivy league dreams for their kid but the kid grows up and just wants to paint we grow an ...
if you want to try a diffusion LLM, inception labs just released one. it goes 1000 tokens/sec on regular H100s, absolutely nuts how fast it is speed...
i’ve been sleeping. sonnet 3.7 is **already** out and available even on free plans www.anthropic.com/news/claude-...
Day 1 of DeepSeek open source week: FlashMLA MLA=Multihead Latent Attention One of the big innovations that made V3 such a notable model github.com...
seems that Grok 3 is a good model, as expected, but also not compelling as a lab, they’ve got a swift upward trajectory, which makes you wonder where...
Large Language Diffusion Models A wildly new AI architecture, this uses diffusion (all tokens at once), not next token prediction ml-gsai.github.io...
Perplexity announced their own DeepResearch that includes a free tier and a generous $20/mo tier People who have tried both are finding the Perplexit...
this is not a drill! Claude 4 is going to be released soon as expected, it's not a "reasoning model", it's just a regular LLM that can reason as need...
my hot take is that if a company gives you a leetcode interview, run like hell away it’s always been sus, but it’s ‘25 now and computers are better p...
Self-Improving Transformers They found that you can train LLMs on their own outputs by 1. generating *slightly harder* problems each time 2. filteri...
the new Cursor update added R1 (served from the US) and this model is *wild*
i posted this to linkedin and i’m very worried people are taking it seriously www.linkedin.com/posts/tim-ke...
the deepseek effect is that now any new model only has to exceed R1 in order to win headlines. the thing is, R1 isn’t state of the art sir, prepare f...
an audible “no fucking way..” escapes my mouth they claim that you can get LLMs to give you well calibrated confidence scores like, i knew you could...
alright, that’s funny, i laughed
oh!!! o3-mini now shows its thought trace chatgpt.com/share/67a556...
s1: The $6 R1 Competitor? This isn't a R1 replication, it's a brilliant breakthrough in data reduction, and just plain dumb engineering ingenuity. I ...
s1: Simple inference-time scaling This is a simple small-scale replication of inference-time scaling It was cheap: 16xH100 for 26 minutes (so what, ...
Mistral Small 3 A 24B LLM that's VERY fast with great function calling More important, MISTRAL IS OPEN SOURCE AGAIN!!!!!! mistral.ai/news/mistral.....
goose — an open source local AI agent for software engineering tasks like debugging, refactoring or deployment use any LLM, integrate via MCP, or any...
Whoah.. sonnet was *not* distilled "3.5 Sonnet was not trained in any way that involved a larger or more expensive model (contrary to some rumors)." ...
🐋 Alert! DeepSeek Janus-Pro-7B It’s multimodal and outperforms Dalle-E and StableDiffusion Probably the biggest feature is it’s ability to generate ...
a researcher on X explains why RL alone didn’t work before it mostly comes down to that todays base models are smarter and have better exploration ...
Explainer: What's R1 and Everything Else This is an attempt to consolidate the dizzying rate of AI developments since Christmas. If you're into AI bu...
huggingface is doing a fully open source replication of R1 github.com/huggingface/...
what’s this? LeCun with an actual good take?
lightpanda — a headless web browser for AI automation written from scratch in Zig /w small size & performance in mind. Not based on chromium or webki...
the R1 effect
i woke up still thinking about Dario’s take character > capabilities it honestly seems like anthropic’s moat. it’s quite an astonishing thought that...
i haven’t fully wrapped my head around R1. i need to read the paper. they seem to have found an extremely effective distillation process — R1 1.5B bea...
i’m finding myself wanting one of NVIDIA’s little hand-sized supercomputers **instead of** my laptop i’m starting to get it. soon the GPU will be th...
The year is 2026. President Musk has outlawed using proprietary model outputs to train smaller models. Attorney General Sam Altman has been tasked wit...
this is nuts a new 7B llama-style LLM for embedding of genomes & detection of pathogens in wastewater i’ve had a hunch that LLMs could lead to some ...
today i’m experimenting with just how large of tasks i can give Cursor Agent, and…i haven’t found the upper bound. it seems to be able to go off for...
omg i nailed it step 1: tell qwen2.5 to describe a scene from frozen in great detail step 2: paste into imagefx as a prompt so great that the chines...
numcat: read a file, and prepend line numbers Why? Because LLMs can reference line numbers easily. Great if you're trying to spot something in a bigg...
not enough is being said about DeepSeek’s multi token prediction (MTP) They were able to get sonnet-level performance with less data than llama 3.3 7...
A new paper dropped from DeepMind: Deliberation in Latent Space via Differentiable Cache Augmentation The trouble is, it's not very readable. I tried...
⚠️ Readable Paper Alert ⚠️ BLT: what if we just got rid of tokenization? Result: * text looks a lot like audio, video, PDF, it’s all just bytes * d...
on mastodon 1-2 years ago i made some statement about how i thought it was a matter of time until ML wasn’t considered AI, and today i’m starting to t...
i wrote down a conversation i keep having — if you’re trying to break into AI Engineering from software engineering, this is for you timkellogg.me/bl...
ollama claims you can use tools with QwQ, so i wired up a script so that qwq can do `find .` and `cat $1` and asked it to figure out which of the scri...
This feels very big Traditional weather forecasting was very compute intensive without clear optimization strategies. This is not only a jump in per...
🚨 Alert: Very Readable Paper 🚨 The “do LLMs think?” question always bugged me because I have no idea what that means. This paper focuses narrowly on,...
i want a LLM CLI tool that only supports one model, a small CPU-ready 360M-1B model that spends almost none of its parameters on knowledge and always ...
i’m starting to think labellers might be more powerful than blocklists. When you hit the “report” button, it gives you a workflow to report to any lab...
this feels like a very big deal 2 trillion tokens of permissively licensed text & code, so you can train (actually) open LLMs and data acquisition i...
entropix: qwen vs llama when using entropix, qwen 2.5 7B coder seems to produce much clearer entropy paths for entropix to follow, vs llama 3.1 8B t...
Scaling Laws for Precision yes, llama models are harder to quantize. They’re “overtrained”, on more data, so quantization removes a lot of critical i...