Does this quirk reproduce on open weights models, i.e. GPT-OSS? Similar reasoning trace quirks in different model families?
Sounds like a fun target for some mechinterpret work. Might be a meaningful behavior, might be meaningless noise, plenty of room to try different things to figure that out.
But, of course, OpenAI wouldn’t let anyone have fun with their proprietary models, so we’d need to replicate this in an open model to start.
Unfortunately while o3 and GPT-5 seem to exhibit this vocabulary, we note in the relevant paper that we did not observe this in OSS-120B which is sad because there could’ve been a lot of interesting work that people would be able to do outside of OpenAI. It’s still very unclear to me why this would be, my best uninformed guess is it was somehow filtered out at some point after distillation (again assuming OSS-120B distilled).
Does this quirk reproduce on open weights models, i.e. GPT-OSS? Similar reasoning trace quirks in different model families?
Sounds like a fun target for some mechinterpret work. Might be a meaningful behavior, might be meaningless noise, plenty of room to try different things to figure that out.
But, of course, OpenAI wouldn’t let anyone have fun with their proprietary models, so we’d need to replicate this in an open model to start.
Unfortunately while o3 and GPT-5 seem to exhibit this vocabulary, we note in the relevant paper that we did not observe this in OSS-120B which is sad because there could’ve been a lot of interesting work that people would be able to do outside of OpenAI. It’s still very unclear to me why this would be, my best uninformed guess is it was somehow filtered out at some point after distillation (again assuming OSS-120B distilled).