Bluesky Thread

gpt-oss-safeguard 20b & 120b

October 30, 2025 View original thread

gpt-oss-safeguard 20b & 120b

a pair of open weights models that let you enforce custom content moderation policies through prompts

they’re reasoning models, so you get a CoT explanation, not just a classification

openai.com/index/introd...

openai.com

Introducing gpt-oss-safeguard

New open safety reasoning models (120b and 20b) that support custom safety policies.

38 2

this feels like the first content moderation model that i’d actually consider using

1. it only looks for what i want it to look for
2. explanations

the normal approach is non-prompted, trained for a very static set of classes

also, this is the first LLM that i’m aware of that separates control & data channels — it takes the prompt & the conversation as two separate inputs

this has always been the right way to approach prompt injection, in my mind

More like this