Here’s a rather out-there hypothesis.
I’m sure many LessWrong members have had the experience of arguing some point piecemeal, where they’ve managed to get weak agreement on every piece of the argument, but as soon as they step back and point from start to end their conversation partner ends up less than convinced. In this sense, in humans even implication isn’t transitive. Mathematics is an example with some fun tales I’m struggling to find sources for, where pre-mathematical societies might have people unwilling to trade two of A for two of B, but happy to trade A for B twice, or other such oddities.
It’s plausible to me that the need for consistent models of the world only comes about as intelligence grows and allows people to arbitrage value between these different parts of their thoughts. Early humans and their lineage before that weren’t all that smart, so it makes sense that evolution didn’t force their beliefs to be consistent all that much—as long as it was locally valid, it worked. As intelligence evolved, occasionally certain issues might crop up, but rather than fixing the issue in a fundamental way, which would be hard, minor kludges were put in place.
For example, I don’t like being exploited. If someone leads me around a pump, I’m going to value the end state less than its ‘intrinsic’ value. You can see this behaviour a lot in discussions of trolley problem scenarios: people take objection to having these thoughts traded off against each other to the degree it often overshadows the underlying dilema. Similarly, I find gambling around opinions intrinsically uncomfortable, and notice that fairly frequently people take objection to me asking them to more precisely quantify their claims, even in cases where I’m not staking an opposing claim. Finally, since some people are better at sounding convincing than I am, it’s completely reasonable to reject some things more broadly because of the possibility the argument is an exploit—this is epistemic learned helplessness, sans ‘learned’.
There are other explanations for all the above, so this is hardly bulletproof, but I think there is merit to considering evolved defenses to exploitation that don’t involve being exploit-free, as well as whether there is any benefit to something of this form. Behaviours that avoid and back away from these exploits seem fairly obvious places to look into. One could imagine (sketchily, non-endorsingly) an FAI built on these principles, so that even without a bulletproof utility function, the AI would still avoid self-exploit.
Most of the complexity in human society is unnecessary to merely outperform the competition. The exploits that prehistoric humans found were readily available; it’s just that evolution could only find them by inventing a better optimizer, rather than getting there directly.
Crafting spears and other weapons is a simple example. The process to make them could be instinctual, and very little intellect is needed. Similar comments apply to clothing and cooking. If they were evolved behaviours, we might even expect parts of these weapons or tools to grow from the animal itself—you might imagine a dedicated role for one of the members of a group, who grows blades or pieces of armour that others can use as needed.
One could imagine plants that grow symbiotically with some mobile species that farms them and keeps them healthy in ways the plant itself is not able to do (eg. weeding), and in return provides nutrition and shelter, which could include enclosed walling over a sizable area.
One could imagine prey, like rabbits, becoming venomous. When resistance starts to form, they could primarily switch to a different venom for a thousand generations before switching back. In fact, you could imagine such venomous rabbits aggressively trying to drive predators extinct before they had the chance to gain a resistance; a short term cost for long-term prosperity.
The overall point is that evolution does not have the insight to get around optimization barriers. Consider brood parasites, where birds lay eggs in other species’ nests. It is hypothesized that a major reason this behaviour is successful is because of retaliatory behaviour when a parasite is ejected. Clearly these victim species would be better off if they just wiped the parasites off the face of the earth, as long as they survived the one-time increased retaliation, but evolutionary pressure resulted in them evolving complicity.
And once you have one form of communication, the pressure to develop a second is almost none.
I agree with almost all of your post, but not this, given the huge number of channels of communication that animals have. Sound, sight, smell and touch are all important bidirectional communication channels between many social animals.
There are lots of simple things that organisms could do to make them wildly more successful. The success of human society is a good demonstration of how very low complexity systems and behaviours can drive your competition extinct, magnify available resources, and more, the vast majority of which could be easily coded into the genome in principle.
However, evolution does not make judgements about the end result. The question is whether there is a path of high success leading to your desired result. Laryngeal nerves are a good demonstration that even basic impediments won’t be worked around if you can’t get there step by step with appropriate evolutionary pressure. Ultimately there seems to be no impetus for a half-baked neuron tentacle, and a lot of cost and risk, so that will probably never be the path to such organisms.
There are many examples of fairly direct inter-organism communication, like RNA transfer between organisms, and to the extent that cells think in chemicals, the fact they share their chemical environment readily is a form of this kind of communication. I’m not aware of anything similarly direct at larger scales, between neurons.
I deny that a generic outside observer would describe us as having any specific set of preferences, in an objective sense.
It’s possible that we’ve been struggling with this conversation because I’ve been failing to grasp just how radically different your opinions are to mine.
Imagine your generic outside observer was superintelligent, and understood (through pure analysis) qualia and all the corresponding mysteries of the mind. Would you then still say this outside observer would not consider us to have any specific set of preferences, in an objective sense, where “preferences” takes on its colloquial meaning?
If not, why? I think my stance is obvious; where preferences colloquially means approximately “a greater liking for one alternative over another or others”, all I have to claim is that there is an objective sense in which I like things, which is simple because there’s an objective sense in which I have that emotional state and internal stance.
“Agent A has preferences R” is not a fact about the world. It is a stance about A, or an interpretation of A. A stance or an interpretation that we choose to take, for some purpose or reason.
I find it hard to imagine that you’re actually denying that you or I have things that, colloquially, one would describe as preferences, and exist in an objective sense. I do have a preference for a happy and meaningful life over a life of pure agony. Anyone who thinks I do not is factually wrong about the state of the world.
Then there is a sense in which the interpretations of these systems we build are fully interpretative. If “preferences R” refers to a function returning a real number, for sure this is not some facet of the real world, and there are many such seemingly-different models for any agent. Here again I believe we agree.
But we seem not to be agreeing at the next step, with the preference stance. Here I claim your goal should not be to maximize the function “preferences R”, whose precise values are irrelevant and independent, but to maximise the actual human preferences.
Consider measuring a simpler system, temperature, and projecting this onto some number. Clearly, depending on how you do this projection, you can end up at any number for a given temperature. Even with a simplicity prior, higher temperatures can correspond to larger numbers or smaller numbers in the projection, with pretty much equal plausibility. So even in this simplified situation, where we can agree that some temperatures are objectively higher than others, you cannot reliably maximize temperature by maximizing its projection.
Your preference function is a projection. The arbitrary choices you have to make to build this function are not assumptions about the world, they are choices about the model. When you prove that you have many models of human preference, you are not proving that preference is entirely subjective.
That’s why, when you use empathy to figure out someone’s goals and rationality, this also allows you to better predict them. But this is a fact about you (and me), not about the world. Just as “Thor is angry” is actually much more complex than electromagnetism, our prediction of other people via our empathy machine is simpler for us to do—but is actually more complex for an agent that doesn’t already have this empathy machinery to draw on.
This Thor analogy is… enlightening of the differences in our perspectives. Imagining an angry Thor is a much more complex hypothesis up until the point you see an actual Thor in the sky hurling spears of lightning. Then it becomes the only reasonable conclusion, because although brains seem like they involve a lot of assumptions, a brain is ultimately many fewer assumptions (to the pre-industrial Norse people) than that same amount of coincidence.
This is the point I am making with people. If your computer models people as arbitrary, randomly sampled programs, of course you struggle to distinguish human behaviour from their contrapositives. However, people are not fully independent, nor arbitrary computing systems. Arguing that a physical person optimizing competently for a good outcome and a physical person optimizing nega-competently for a bad outcome are similarly simple has to overcome at least two hurdles:
1. We seem to know things about which mental states are good and which mental states are bad. This implies there is objective knowledge that can be learnt about it.
2. You would need to extend your arguments about mathematical functions into the real world. I don’t know how this could be approached.
I have a hard time believing that in another world people think that the qualia corresponding to our suffering is good and the qualia corresponding to our happiness is bad, and if it is, this strikes me as a much bigger deal than anything else you are saying.
One of us is missing what the other is saying. I’m honestly not sure what argument you are putting forth here.
I agree that preference/reward is an interpretation (the terms I used were map and territory). I agree that (p,R) and (-p,-R) are approximately equally complex. I do not agree that complexity is necessarily isomorphic between the map and the territory. This means although the model might be a strong analogy when talking about behaviour, it is sketchy to use it as a model for complexity of behaviour.
But that doesn’t detract from the main point: that simplicity, on its own, is not sufficient to resolve the issue.
It kind of does. You have shown that simplicity cannot distinguish (p, R) from (-p, -R), but you have not shown that simplicity cannot distinguish a physical person optimizing competently for a good outcome from a physical person optimizing nega-competently for a bad outcome.
If it seems unreasonable for there to be a difference, consider a similar map-territory distinction of a height map to a mountain. An optimization function that gradient descents on a height map is the same complexity, or nearabouts, as one that gradient ascents on the height map’s inverse. However, a system that physically gradient descents on the actual mountains can be much simpler than one that gradient ascents on the mountain’s inverse. Since negative mental experiences are somehow qualitatively different to positive ones, it would not surprise me much if they did in fact effect a similar asymmetry here.
Obviously misery would be avoided because it’s bad, not the other way around.
As mentioned, this isn’t obvious to me, so I’d be interested in your reasoning. Why should evolution build systems that want to avoid intrinsically bad mental states?
We are trying to figure out what is bad by seeing what we avoid. And the problem remains whether we might be accidentally avoiding misery, while trying to avoid its opposite.
Yes, my point here was twofold. One, the formalism used in the paper does not seem to be deeply meaningful, so it would be best to look for some other angle of attack. Two, given the claim about intrinsic badness, the programmer is embedding domain knowledge (about conscious states), not unlearnable assumptions. A computer system would fail to learn this because qualia is a hard problem, not because it’s unlearnable. This makes it asymmetric and circumventable in a way that the no free lunch theorem is not.
Pushing q towards 1 might be a disaster
If I consider satisfaction of my preferences to be a disaster, in what sense can I realistically call them my preferences? It feels like you’re more caught up on the difficulty of extrapolating these preferences outside of their standard operation, but that seems like a rather different issue.
Fair warning, the following is pretty sketchy and I wouldn’t bet I’d stick with it if I thought a bit longer.
Imagine a simple computer running a simple chess playing program. The program uses purely integer computation, except to calculate its reward function and to run minimax over them, which is in floating point. The search looks for the move that maximizes the outcome, which corresponds to a win.
This, if I understand your parlance, is ‘rational’ behaviour.
Now consider that the reward is negated, and the planner instead looks for the move that minimizes the outcome.
This, if I understand your parlance, is ‘anti-rational’ behaviour.
Now consider that this anti-rational program is run on a machine where floating point values encoded with a sign bit ‘1’ represent a positive number and those with a ‘0’ sign bit a negative number—the opposite to the standard encoding.
It’s the same ‘anti-rational’ program, but exactly the same wires are lit up in the same pattern on this hardware as with the ‘rational’ program on the original hardware.
In what sense can you say the difference between rationality and anti-rationality at all exists in the program (or in humans), rather than in the model of them, when the same wires are both rational and anti-rational? I believe the same dilemma holds for indifferent planners. It doesn’t seem like reward functions of the type your paper talks about are a real thing, at least in a sense independent of interpretation, so it makes sense that you struggle to distinguish them when they aren’t there to distinguish.
I am tempted to base an argument off the claim that misery is avoided because it’s bad rather than being bad because it’s avoided. If true, this shortcuts a lot of your concern: reward functions exist only in the map, where numbers and abstract symbols can be flipped arbitrarily, but in the physical world these good and bad states have intrinsic quality to them and can be distinguished meaningfully. Thus the question is not how to distinguish indistinguishable reward functions, but how to understand this aspect of qualitative experience. Then, presumably, if a computer could understand what the experience of unhappiness is like, it would not have to assume our preferences.
This doesn’t help solve the mystery. Why couldn’t a species evolve to maximise its negative internal emotional states? We can’t reasonably have gotten preference and optimization lined up by pure coincidence, so there must be a reason. But it seems like a more reasonable stance to shove the question off into the ineffable mysteries of qualia than to conflate it with a formalism that seems necessarily independent of the thing we’re trying to measure.
I believe I understood this metaphor. However, it seems to me this isn’t a good place to be, since I predict the metaphor is only useful to ground discussion about the thing that’s actually taking place. It is that second step that hasn’t worked.
Let’s flip this around. How do you know when someone is Looking? Is there a way to do so based on external behaviours? What is your equivalent of the following?
“I’m watching you stare at your phone. If your Looking, your head would be up and your eyes would be pointed at me.”
You give a good example with the hair clipper, but I don’t know how much, if at all, that relates to Looking. If it is closely related I have a few follow-up questions that probably get to the crux of the issue I specifically am stuck on.
The exercise in falsification refers to Conor’s last sentence, only no longer applied specific to him.
I’m wondering how you would falsify the claim (that I predict you will make and be justified in making) that I don’t get it.
When I say I am confused about what I am meant to be confused about, I mean that I’m failing to identify as Alex. He at least has a command he knows he cannot do (Look above that! / That’s the top.), whereas I am stuck in the realm of unknown unknowns.
Your paragraph on the “it” from your kenshō is a much closer description of how I currently feel than the inverse is; I don’t understand what it would mean for this claim to be untrue except in the sense that it “not being okay” accurately describes external reality. But that feels like it falls into the same trap that your bullet points are said to, only in the opposite direction.
Your later post about the benefits does this more clearly; with absolute exception of the point about energy, and potential exception of the last, the other points seem oddly accurate representations of the difference between me and the average person. But I don’t think I am enlightened.
So, on a concrete level, this comes through as the question of how would you differentiate someone who was born enlightened from someone who was not, but is perhaps mistakenly labelling a shallow surface immitation?
I would be interested in how you would falsify it regardless. I am confused about what I am meant to be confused about (what does it mean for it to not be okay?) and I suspect the excersise would remedy that.
I am a very strong satisficer, in direct conflict to my moral system which would rather I maximise, so I live under the general understanding that I’m very far from my ideal.
I formed a similar argument around vegetarianism; I predicted that it is easier for me to draw a hard line than it is to reconsider that line on a case by case basis. Rational me is more than capable of distinguishing between lobster and cow, but there is a lot of power in being able to tell myself to just eat the things with the label.
This is an extreme overapproximation but, given the moral stakes and my general unreliability, the successful results seem sufficient justification.