FWIW it seems to me that EY did not carefully read your post, and missed your distinction between having the human utility function somewhere in the AI vs. explicitly. Assuming you didn’t edit the post, your paragraph here
The key difference between the value identification/specification problem and the problem of getting an AI to understand human values is the transparency and legibility of how the values are represented: if you solve the problem of value identification, that means you have an actual function that can tell you the value of any outcome (which you could then, hypothetically, hook up to a generic function maximizer to get a benevolent AI). If you get an AI that merely understands human values, you can’t necessarily use the AI to determine the value of any outcome, because, for example, the AI might lie to you, or simply stay silent.
makes this clear enough. But my eyes sort of glazed over this part. Why? Quoting EY’s comment above:
We are trying to say that because wishes have a lot of hidden complexity, the thing you are trying to get into the AI’s preferences has a lot of hidden complexity.
A lot of the other sentences in your post sound like things that would make sense to say if you didn’t understand this point, and that wouldn’t make sense to say if you did understand this point. EY’s point here still goes through even if you have the ethical-situations-answerer. I suspect that’s why others, and I initially, misread / projected onto your post, and why (I suspect) EY took your explicit distancing as not reflecting your actual understanding.
Unfortunately, I must say I actually did add that paragraph in later to make my thesis clearer. However, the version that Eliezer, Nate and Rob replied to still had this paragraph, which I think makes essentially the same point (i.e. that I am not merely referring to passive understanding, but rather explicit specification):
I’m not arguing that GPT-4 actually cares about maximizing human value. However, I am saying that the system is able to transparently pinpoint to us which outcomes are good and which outcomes are bad, with fidelity approaching an average human, albeit in a text format. Crucially, GPT-4 can do this visibly to us, in a legible way, rather than merely passively knowing right from wrong in some way that we can’t access. This fact is key to what I’m saying because it means that, in the near future, we can literally just query multimodal GPT-N about whether an outcome is bad or good, and use that as an adequate “human value function”. That wouldn’t solve the problem of getting an AI to care about maximizing the human value function, but it would arguably solve the problem of creating an adequate function that we can put into a machine to begin with.
FWIW it seems to me that EY did not carefully read your post, and missed your distinction between having the human utility function somewhere in the AI vs. explicitly. Assuming you didn’t edit the post, your paragraph here
makes this clear enough. But my eyes sort of glazed over this part. Why? Quoting EY’s comment above:
A lot of the other sentences in your post sound like things that would make sense to say if you didn’t understand this point, and that wouldn’t make sense to say if you did understand this point. EY’s point here still goes through even if you have the ethical-situations-answerer. I suspect that’s why others, and I initially, misread / projected onto your post, and why (I suspect) EY took your explicit distancing as not reflecting your actual understanding.
Unfortunately, I must say I actually did add that paragraph in later to make my thesis clearer. However, the version that Eliezer, Nate and Rob replied to still had this paragraph, which I think makes essentially the same point (i.e. that I am not merely referring to passive understanding, but rather explicit specification):
Ok, thanks for clarifying that that paragraph was added later.
(My comments also apply to the paragraph that was in the original.)