Joe Carlsmith comments on Video and transcript of talk on giving AIs safe motivations

Joe Carlsmith 24 Oct 2025 4:43 UTC
LW: 4 AF: 3
0
AF
Hi Steve—thanks for this comment, I can see how the vibe of the talk/piece might call to mind something like “studying/intervening on an existing AI system” rather than focusing on how its trained/constructed, but I do mean for the techniques I discuss to cover both. For example, and re: your Bob example, I talk about our existing knowledge of human behavior as an example of behavioral science here—and I talk lot about studying training as a part of behavioral science, e.g.:
Let’s call an AI’s full range of behavior across all safe and accessible-for-testing inputs its “accessible behavioral profile.” Granted the ability to investigate behavioral profiles of this kind in-depth, it also becomes possible to investigate in-depth the effect that different sorts of interventions have on the profile in question. Example effects like this include: how the AI’s behavioral profile changes over the course of training; how the behavioral profile varies across different forms of training; how it responds to other kinds of interventions on the AI’s internals (though: this starts to border on “transparency tools”); how it varies based on the architecture of the AI; etc. Here I sometimes imagine a button that displays some summary of an AI’s accessible behavioral profile when pressed. In principle, you could be pressing that button constantly, whenever you do anything to an AI, and seeing what you can learn.
And techniques for training/constructing AIs that benefit from understanding/direct design of their internals would count as “transparency tools” for me.