Bluesky Thread

gpt-oss-safeguard 20b & 120b

View original thread
gpt-oss-safeguard 20b & 120b

a pair of open weights models that let you enforce custom content moderation policies through prompts

they’re reasoning models, so you get a CoT explanation, not just a classification

openai.com/index/introd...
openai.com
Introducing gpt-oss-safeguard
New open safety reasoning models (120b and 20b) that support custom safety policies.
38 2
this feels like the first content moderation model that i’d actually consider using

1. it only looks for what i want it to look for
2. explanations

the normal approach is non-prompted, trained for a very static set of classes
9
also, this is the first LLM that i’m aware of that separates control & data channels — it takes the prompt & the conversation as two separate inputs

this has always been the right way to approach prompt injection, in my mind
6
38 likes 2 reposts

More like this

×