mindprison comments on I Tested LLM Agents on Simple Safety Rules. They Failed in Surprising and Informative Ways.

mindprison 14 Jul 2025 15:56 UTC
3 points
0
I used Infi-gram to prove the data exists in the training set as well as other prompts that could reveal the information exists. For example, LLMs sometimes could not answer the question directly, but when asked to list the episodes it could do so revealing the episode exists within the LLMs dataset.
FYI—“Ring around Gilligan” is surfaced incorrectly. It is not about mind reading. It is about controlling another person through a device that makes them do whatever asked.
Although I can’t know specifically why some models are able to now answer the question, it isn’t unexpected that they would eventually. With more training and bigger models the statistical bell curve of what the model can surface does widen.
BTW, your primary use case is mine as well. Unfortunately, I have had no luck with reliability for processing knowledge in the context window. My best solution has been to prompt for direct citations for any document so I can easily verify if the result is a hallucination or not. Doesn’t stop hallucinations, but just helps me more quickly identify them.
I suspect training for such specific tasks might improve performance somewhat, but hallucinations will never go away on this type of architecture. I wrote something in detail recently about that here “AI Hallucinations: Proven Unsolvable—What Do We Do?”
Sorry for the late reply, my karma here is negative and I have a 3 day penalty on replies. For some reason everything I’ve posted here received lots of downvotes without comment. So I’ve mostly quit posting here.
- Misha Ramendik 11 Aug 2025 18:45 UTC
  2 points
  0
  Parent
  Understood, thanks!
  Now, I have some ideas specifically about knowledge in the context window (in line with your “provide citations” but with more automated steps, in line with the “programmatic verification” that you mention in your article). I need to experiment before I can see if they work. And right now I’m stuck on getting an open source chat environment working in order to scaffold this task. (LibreChat just outright failed to create a user; OpenWebUI looks better but I’m probably sticking all my processing into LiteLLM or something like that, because finding hooks in these environments is not easy).
  I won’t brag about idea details. Let me see if they work first.
  Hallucinations about training knowledge cannot be solved. And I do suspect that your article is the primary reason some models answer correctly. There is a tendency to optimize for benchmarks and your article is a de facto benchmark.
  (The “Ring around Gilligan” part is a typical “fandom debate”. I’ve had my share of those, not about this series of course, but boooy, Babylon 5 brings memories—I had [Team Anna Sheridan] in my Fidonet signature for some time. My suspicion is that “Ring around Gilligan” it is surfaced specifically because someone at OpenAI thinks the ring in question logically would allow mind-reading, and the rest is RLHF to one-up you)