lennie comments on The behavioral selection model for predicting AI motivations

lennie 13 Dec 2025 16:44 UTC
2 points
0
Hi Alex—great post—thanks so much!
I’m intrigued your thoughts on the list of different ‘priors’. I actually tried to explain some of these ideas in a lecture earlier this year, largely drawing from Evan’s presentation in ‘How likely is deceptive alignment?‘; the notion of ‘prior’ here is clearly important but I found the topic awkward to talk about since I had near zero intuition for which arguments were more/less relevant to the current LLM (or near-future AI) paradigm.
Your section mostly refers to Joe Carlsmith’s ‘Will AIs fake alignment...’ paper from 2023, which has really nice explanations of Joe’s PoV from then, and outlines some directions for empirical research.
Are you (or anyone else) aware of any more recent work on the matter?
(I’d be interested to know both about empirics, and conceptual/heuristic takes/syntheses).
Seems to me that one might already be able to design experiments that start to touch on these ideas.
Would be very interested to discuss possible experimental ideas if this inspires any!
- Alex Mallen 15 Dec 2025 19:07 UTC
  2 points
  0
  Parent
  Are you (or anyone else) aware of any more recent work on the matter?
  I’m not aware of more recent work on the matter (aside from Hebbar), but I could be missing some.
  Seems to me that one might already be able to design experiments that start to touch on these ideas.
  I also wrote up a basic project proposal for studying simplicity, speed, and salience priors here.