I’ll preface this by saying that I don’t see why it’s a problem, for purposes of alignment, for human values to refer to non-existent entities. This should manifest as humans and their AIs wasting some time and energy trying to optimize for things that don’t exist, but this seems irrelevant to alignment. If the AI optimizes for the same things that don’t exist as humans do, it’s still aligned; it isn’t going to screw things up any worse than humans do.
But I think it’s more important to point out that you’re joining the same metaphysical goose chase that has made Western philosophy non-sense since before Plato.
You need to distinguish between the beliefs and values a human has in its brain, and the beliefs & values it expresses to the external world in symbolic language. I think your analysis concerns only the latter. If that’s so, you’re digging up the old philosophical noumena / phenomena distinction, which itself refers to things that don’t exist (noumena).
Noumena are literally ghosts; “soul”, “spirit”, “ghost”, “nature”, “essence”, and “noumena” are, for practical purposes, synonyms in philosophical parlance. The ghost of a concept is the metaphysical entity which defines what assemblages in the world are and are not instances of that concept.
But at a fine enough level of detail, not only are there no ghosts, there are no automobiles or humans. The Buddhist and post-modernist objections to the idea that language can refer to the real world are that the referents of “automobiles” are not exactly, precisely, unambiguously, unchangingly, completely, reliably specified, in the way Plato and Aristotle thought words should be. I.e., the fact that your body gains and loses atoms all the time means, for these people, that you don’t “exist”.
Plato, Aristotle, Buddhists, and post-modernists all assumed that the only possible way to refer to the world is for noumena to exist, which they don’t. When you talk about “valuing the actual state of the world,” you’re indulging in the quest for complete and certain knowledge, which requires noumena to exist. You’re saying, in your own way, that knowing whether your values are satisfied or optimized requires access to what Kant called the noumenal world. You think that you need to be absolutely, provably correct when you tell an AI that one of two words is better. So those objections apply to your reasoning, which is why all of this seems to you to be a problem.
The general dissolution of this problem is to admit that language always has slack and error. Even direct sensory perception always has slack and error. The rationalist, symbolic approach to AI safety, in which you must specify values in a way that provably does not lead to catastrophic outcomes, is doomed to failure for these reasons, which are the same reasons that the rationalist, symbolic approach to AI was doomed to failure (as almost everyone now admits). These reasons include the fact that claims about the real world are inherently unprovable, which has been well-accepted by philosophers since Kant’s Critique of Pure Reason.
That’s why continental philosophy is batshit crazy today. They admitted that facts about the real world are unprovable, but still made the childish demand for absolute certainty about their beliefs. So, starting with Hegel, they invented new fantasy worlds for our physical world to depend on, all pretty much of the same type as Plato’s or Christianity’s, except instead of “Form” or “Spirit”, their fantasy worlds are founded on thought (Berkeley), sense perceptions (phenomenologists), “being” (Heidegger), music, or art.
The only possible approach to AI safety is one that depends not on proofs using symbolic representations, but on connectionist methods for linking mental concepts to the hugely-complicated structures of correlations in sense perceptions which those concepts represent, as in deep learning. You could, perhaps, then construct statistical proofs that rely on the over-determination of mental concepts to show almost-certain convergence between the mental languages of two different intelligent agents operating in the same world. (More likely, the meanings which two agents give to the same words don’t necessarily converge, but agreement on the probability estimates given to propositions expressed using those same words will converge.)
Fortunately, all mental concepts are over-determined. That is, we can’t learn concepts unless the relevant sense data that we’ve sensed contains much more information than do the concepts we learned. That comes automatically from what learning algorithms do. Any algorithm which constructed concepts that contained more information than was in the sense data, would be a terrible, dysfunctional algorithm.
You are still not going to get a proof that two agents interpret all sentences exactly the same way. But you might be able to get a proof which shows that catastrophic divergence is likely to happen less than once in a hundred years, which would be good enough for now.
Perhaps what I’m saying will be more understandable if I talk about your case of ghosts. Whether or not ghosts “exist”, something exists in the brain of a human who says “ghost”. That something is a mental structure, which is either ultimately grounded in correlations between various sensory perceptions, or is ungrounded. So the real problem isn’t whether ghosts “exist”; it’s whether the concept “ghost” is grounded, meaning that the thinker defines ghosts in some way that relates them to correlations in sense perceptions. A person who thinks ghosts fly, moan, and are translucent white with fuzzy borders, has a grounded concept of ghost. A person who says “ghost” and means “soul” has an ungrounded concept of ghost.
Ungrounded concepts are a kind of noise or error in a representational system. Ungrounded concepts give rise to other ungrounded concepts, as “soul” gave rise to things like “purity”, “perfection”, and “holiness”. I think it highly probable that grounded concepts suppress ungrounded concepts, because all the grounded concepts usually provide evidence for the correctness of the other grounded concepts. So probably sane humans using statistical proofs don’t have to worry much about whether every last concept of theirs is grounded, but as the number of ungrounded concepts increases, there is a tipping point beyond which the ungrounded concepts can be forged into a self-consistent but psychotic system such as Platonism, Catholicism, or post-modernism, at which point they suppress the grounded concepts.
Sorry that I’m not taking the time to express these things clearly. I don’t have the time today, but I thought it was important to point out that this post is diving back into the 19th-century continental grappling with Kant, with the same basic presupposition that led 19th-century continental philosophers to madness. TL;DR: AI safety can’t rely on proving statements made in human or other symbolic languages to be True or False, nor on having complete knowledge about the world.
What do you mean by a goodhearting problem, & why is it a lossy compression problem? Are you using “goodhearting” to refer to Goodhart’s Law?