This DeepMind paper explores some intrinsic limitations of agentic LLMs. The basic idea is (my words):
If the training data used by an LLM is generated by some underlying process (or context-dependent mixture of processes) that has access to hidden variables, then an LLM used to choose actions can easily go out-of-distribution.
For example, suppose our training data is a list of a person’s historical meal choices over time, formatted as tuples that look like (Meal Choice, Meal Satisfaction)
. The training data might look like (Pizza, Yes)(Cheeseburger, Yes)(Tacos, Yes)
.
When the person originally chose what to eat, they might have had some internal idea of what food they wanted to eat that day, so the list of tuples will only include examples where the meal was satisfying.
If we try to use the LLM to predict what food a person ought to eat, that LLM won’t have access to the person’s hidden daily food preference. So it might make a bad prediction and you could end up with a tuple like (Taco, No)
. This immediately puts the rest of the sequence out-of-distribution.
The paper proposes various solutions for this problem. I think that increasing scale probably helps dodge this issue, but it does show an important weak point of using LLMs to choose causal actions.
I think that humans are sorta “unaligned”, in the sense of being vulnerable to Goodhart’s Law.
A lot of moral philosophy is something like:
Gather our odd grab bag of heterogeneous, inconsistent moral intuitions
Try to find a coherent “theory” that encapsulates and generalizes these moral intuitions
Work through the consequences of the theory and modify it until you are willing to bite all the implied bullets.
The resulting ethical system often ends up having some super bizarre implications and usually requires specifying “free variables” that are (arguably) independent of our original moral intuitions.
In fact, I imagine that optimizing the universe according to my moral framework looks quite Goodhartian to many people.
Some examples of implications of my current moral framework:
I think that (a) personhood is preserved when moved to simulation (b) it’s easier to control what’s happening in a simulation, and consequently easier to fulfill a person’s preferences. Therefore, it’d be ideal to upload as many people as possible. In fact, I’m not sure whether or not this should even be optional, given how horrendously inefficient the ratio of organic human atoms to “utilons” is.
I value future lives, so I think we have an ethical responsibility to create as many happy beings as we can, even at some cost to current beings.
I think that some beings are fundamentally capable of being happier than other beings. So, all else equal, we should prefer to create happier people. I think that parents should be forced to adhere to this when having kids.
I think that we should modify all animals so we can guarantee that they have zero consciousness, or otherwise guarantee that they don’t suffer (how do we deal with lions’ natural tendency to brutally kill gazelles?)
I think that people ought to do some limited amount of wire-heading (broadly increasing happiness independent of reality).
Complete self-determination/subjective “free-will” is both impossible and not desirable. SAI will be able to subtly, but meaningfully, guide humans down chosen paths because it can robustly predict the differential impact of seemingly minor conversational and environmental variations.
I’m sure there are many other examples.
I don’t think that my conclusions are wrong per se, but… my ethical system has some alien and potentially degenerate implications when optimized hard.
It’s also worth noting that although I stated those examples confidently (for rhetorical purposes), my stances on many of them depend on very specific details of my philosophy and have toggled back and forth many times.
No real call to action here, just some observations. Existing human ethical systems might look as exotic to the average person as some conclusions drawn by a kinda-aligned SAI.