It seems like the notion of “psychology” that you’re invoking here isn’t really about “how the AI will make decisions.” On my read, you’re defining “psychology” as “the prior over policies.” This bakes in things like “hard constraints that a policy never takes an unsafe action (according to a perfect oracle)” by placing 0 probability on such policies in the prior. This notion of “psychology” isn’t directly about internal computations or decision making. (Though, of course, some priors—e.g. the circuit depth prior on transformers—are most easily described in terms of internal computations.)
It seems like the notion of “psychology” that you’re invoking here isn’t really about “how the AI will make decisions.” On my read, you’re defining “psychology” as “the prior over policies.” This bakes in things like “hard constraints that a policy never takes an unsafe action (according to a perfect oracle)” by placing 0 probability on such policies in the prior. This notion of “psychology” isn’t directly about internal computations or decision making. (Though, of course, some priors—e.g. the circuit depth prior on transformers—are most easily described in terms of internal computations.)