Sam Marks comments on No instrumental convergence without AI psychology

Sam Marks 21 Jan 2026 19:07 UTC
LW: 4 AF: 4
0
AF
It seems like the notion of “psychology” that you’re invoking here isn’t really about “how the AI will make decisions.” On my read, you’re defining “psychology” as “the prior over policies.” This bakes in things like “hard constraints that a policy never takes an unsafe action (according to a perfect oracle)” by placing 0 probability on such policies in the prior. This notion of “psychology” isn’t directly about internal computations or decision making. (Though, of course, some priors—e.g. the circuit depth prior on transformers—are most easily described in terms of internal computations.)