gpt-oss-safeguard 20b & 120b
a pair of open weights models that let you enforce custom content moderation policies through prompts
they’re reasoning models, so you get a CoT explanation, not just a classification
openai.com/index/introd...
gpt-oss-safeguard 20b & 120b
View original threadthis feels like the first content moderation model that i’d actually consider using
1. it only looks for what i want it to look for
2. explanations
the normal approach is non-prompted, trained for a very static set of classes
1. it only looks for what i want it to look for
2. explanations
the normal approach is non-prompted, trained for a very static set of classes
9
also, this is the first LLM that i’m aware of that separates control & data channels — it takes the prompt & the conversation as two separate inputs
this has always been the right way to approach prompt injection, in my mind
this has always been the right way to approach prompt injection, in my mind
6