Fabien Roger comments on LLMs are Capable of Misaligned Behavior Under Explicit Prohibition and Surveillance

Fabien Roger 11 Jul 2025 15:53 UTC
2 points
0
I think it would be easier for me to know how impressed/scared to be if there was a website or a pdf where I could see sample trajectories with all the information that the model has access to by default in its original working directory. It’s also a bit unclear to me what the auditing system would have blocked—the high-level trajectory you describe looks like blatant cheating so I am unsure to what extent I should update on “under surveillance”.
- Igor Ivanov 11 Jul 2025 16:22 UTC
  1 point
  1
  Parent
  You wrote in the description of your profile “Book a call with me if you want advice on a concrete empirical safety project.”
  Can I book a call to ask you for a feedback on this issue? I want to ask a couple of questions about your last message because I want to learn how I can make my results more convincing, so this paper and my future ones will be more impactful.