I’m intrigued your thoughts on the list of different ‘priors’. I actually tried to explain some of these ideas in a lecture earlier this year, largely drawing from Evan’s presentation in ‘How likely is deceptive alignment?‘; the notion of ‘prior’ here is clearly important but I found the topic awkward to talk about since I had near zero intuition for which arguments were more/less relevant to the current LLM (or near-future AI) paradigm.
Your section mostly refers to Joe Carlsmith’s ‘Will AIs fake alignment...’ paper from 2023, which has really nice explanations of Joe’s PoV from then, and outlines some directions for empiricalresearch.
Are you (or anyone else) aware of any more recent work on the matter? (I’d be interested to know both about empirics, and conceptual/heuristic takes/syntheses).
Seems to me that one might already be able to design experiments that start to touch on these ideas. Would be very interested to discuss possible experimental ideas if this inspires any!
Hi Alex—great post—thanks so much!
I’m intrigued your thoughts on the list of different ‘priors’. I actually tried to explain some of these ideas in a lecture earlier this year, largely drawing from Evan’s presentation in ‘How likely is deceptive alignment?‘; the notion of ‘prior’ here is clearly important but I found the topic awkward to talk about since I had near zero intuition for which arguments were more/less relevant to the current LLM (or near-future AI) paradigm.
Your section mostly refers to Joe Carlsmith’s ‘Will AIs fake alignment...’ paper from 2023, which has really nice explanations of Joe’s PoV from then, and outlines some directions for empirical research.
Are you (or anyone else) aware of any more recent work on the matter?
(I’d be interested to know both about empirics, and conceptual/heuristic takes/syntheses).
Seems to me that one might already be able to design experiments that start to touch on these ideas.
Would be very interested to discuss possible experimental ideas if this inspires any!
I’m not aware of more recent work on the matter (aside from Hebbar), but I could be missing some.
I also wrote up a basic project proposal for studying simplicity, speed, and salience priors here.