Jessica Taylor. CS undergrad and Master’s at Stanford; former research fellow at MIRI.
I work on decision theory, social epistemology, strategy, naturalized agency, mathematical foundations, decentralized networking systems and applications, theory of mind, and functional programming languages.
Blog: unstableontology.com
Twitter: https://twitter.com/jessi_cata
I’ve written criticisms of orthogonality: The Obliqueness Thesis, Measuring intelligence and reverse-engineering goals.
While I do think human moral reasoning suggests non-orthogonality, it’s a somewhat conceptually tricky case. So recently I’ve been thinking about more straightforward ways of showing non-orthogonality relative to an architecture.
For example, consider RL agents playing Minecraft. If you want to get agents that beat the game, you could encode this preference function directly as a reward function, reward it when it beats the game. However this fails in practice.
The alternative is reward shaping. Reward it for pursuing instrumental values like exploring or getting new resources. This agent is much more likely to win, despite it being mis-aligned.
What this shows is that reinforcement learning is a non-orthogonal architecture. Some goals (reward functions) lead to more satisfaction of convergent instrumental goals than others.
Slightly tricker case is humans. Direct encoding of inclusive fitness as human neural values seems like it would produce high fitness, but we don’t see humans have this, therefore the space evolution is searching over is probably non-orthogonal.
Maybe it’s like the RL case where organisms are more likely to have fitness if they have neural encodings of instrumental goals, which are easier to optimize short-term. Fixed-action patterns suggest something like this, there’s a “terminal value” of engaging in fixed action patterns (which happen to be ones that promote fitness; evolution searched over many possible fixed action patterns).
So instead of assuming “organisms get more fitness by having values aligned with inclusive fitness” we could re-frame as, “inclusive fitness is a meta-value over organisms (including their values), some values lead to higher inclusive fitness than others, empirically”.
This approach could be used to study human morality. Maybe some tendencies to engage in moral reasoning lead to more fitness, even if moral reasoning isn’t straightforwardly aligned with fitness. Perhaps because, morality is a convenient proxy that works in bounded rationality.
A thesis would be something like, orthogonality holds for almost no architectures. Relative to an architecture like RL or neural encodings of values, there are almost always “especially smart values” that lead to more convergent instrumental goal achievement. Evolution will tend to find these empirically.
This doesn’t contradict that there is some architecture that is orthogonal, which I take to be the steelman of the orthogonality thesis. However it suggests that even if this steelman is true, it has limited applicability to empirically realized agent architectures, and in particular doesn’t apply to human preference/morality.