Misha Ramendik comments on I Tested LLM Agents on Simple Safety Rules. They Failed in Surprising and Informative Ways.

Misha Ramendik 11 Aug 2025 18:45 UTC
2 points
0
Understood, thanks!
Now, I have some ideas specifically about knowledge in the context window (in line with your “provide citations” but with more automated steps, in line with the “programmatic verification” that you mention in your article). I need to experiment before I can see if they work. And right now I’m stuck on getting an open source chat environment working in order to scaffold this task. (LibreChat just outright failed to create a user; OpenWebUI looks better but I’m probably sticking all my processing into LiteLLM or something like that, because finding hooks in these environments is not easy).
I won’t brag about idea details. Let me see if they work first.
Hallucinations about training knowledge cannot be solved. And I do suspect that your article is the primary reason some models answer correctly. There is a tendency to optimize for benchmarks and your article is a de facto benchmark.
(The “Ring around Gilligan” part is a typical “fandom debate”. I’ve had my share of those, not about this series of course, but boooy, Babylon 5 brings memories—I had [Team Anna Sheridan] in my Fidonet signature for some time. My suspicion is that “Ring around Gilligan” it is surfaced specifically because someone at OpenAI thinks the ring in question logically would allow mind-reading, and the rest is RLHF to one-up you)