I don’t know, soldiers obeying orders? Drinking alcohol under social pressure? Of course, it’s not clear corrigibility is an appropriate concept if you don’t think in terms of goals.
Signer
But then what can you conclude from this goal-directness? “Generally” and “usually” are free parameters. Sometimes people are somewhat corrigible. If you don’t have a procedure for determining whether your situation is usual, you can only get heuristic-level reliability on your predictions. And then it’s not clear how relatively useful the goal-directness heuristic even is—maybe it’s more useful to just remember, that people don’t usually burn all their money in a pit, independently of their goals. And if there is a better frame, then maybe you shouldn’t think “obviously some minds are goal directed”.
obviously some minds are goal directed?
What do you mean by this? Some theoretical minds, sure. But otherwise it’s either false or leaves free parameters—no known mind maximizes anything non-trivial, because it’s uncomputable or at least too computationally costly to strictly maximize. And if you say that some mind is approximately goal directed, then it remains to be shown that consequences of strict theory survive this approximation.
The opposing faction is forced to that effort of hurried misinterpretation, because the actual impact of the concept of “we only get one shot at ASI” is so devastating to their position, in the presence of even a slight understanding of why there might be any noteworthy engineering difficulties whatsoever.
Or it’s the writer’s fault and calling it “one shot” is just a bad choice of words, when it being correct depends on specific decomposition into shots and “irretrievability” is better. Like, people are forced to say “there’ll be multiple first critical tries” to describe a situation where you can repeatedly fail to notice AI scheming. And it’s especially bad when you simultaneously endorse “you can’t try after ASI kills you” and “ASI that can actually kill you is importantly different situation”. Irretrievability, distribution shift and their correlation should be argued independently. Otherwise people will go off topic like this:
But also, in a very ordinary way, there just isn’t a nonlethal way to test out lethal levels of superintelligence.
You can test lethal levels of nonsuper intelligence instead!
Instrumental convergence and optimality of power-seeking are facts that describe important facets of reality. They unpack to precise + empirical + useful models of many dynamics in economics, games, markets, biology, computer security, and many adversarial interactions among humans generally.
But they don’t unpack to optimality being a real thing. No real entity actually optimizes anything, except maybe everything minimizes action. “It’s useful in economics” doesn’t mean you can just extrapolate it wherever.
I think people who build AI systems every day are “wildly miscalibrated” on how empirically well-supported and widely applicable these dynamics and methods of thinking are outside their own field.
What is supported by what? Is the claim that thinking about utility worked for economists, so everyone should think about utility, or that empirical research shows that anyone smart is trying to conquer the world, or what is the claim and what it is the evidence?
It is all ungrounded philosophy without quantifying what actual theories match reality by how much.
Another double-counting: wanting for people to be saved for altruistic reasons and wanting to personally do things that save people.
Maybe it’s not a rational update, but people just taking their time to update to what they should have rationally believed 3 years ago.
Fire is reducible.
And so are qualia. The only difference is that the science haven’t yet provided a useful reduction. But laws of physics still don’t say how you should reduce things. And reductions doesn’t preserve everything—fire can look continuous, but actually consist of atoms.
No it comes from the observationthat our sensorium is not a picture of our brains?
You can also observe that fire is not a picture of atoms. Reductions are indirect, can have different precision and some parts of observations are just wrong. There are no observations that contradict future neurology predicting all your experiences more precisely than you can feel them now.
If Qualia are identified with microphysical properties, those properties need to be localised to solve the binding problem.
Again, there is no reason to directly identify qualia with microphysical properties. You don’t need to make atoms continuous to bind them to continuous-looking fire. The idea is to only identify phenomenal nature of qualia with physical existence. After that science can figure out specific useful model and just say “your observations of qualia are not sensitive enough to say anything about localization on nanometer scale” like it says in the case of continuous-looking fire.
I’m not saying that figuring out how brain implements human experiences is a solved or uninteresting problem. It’s just not a Hard, philosophical problem. At least no more, than in the case of fire.
It factors into localised parts.
Approximately localized. And even without quantum effects there are definitely relevant interactions on the macro scale. And gravity. And space. I just don’t get how “physical human is not spatially extensive” objection makes sense. Of course, it doesn’t matter, because saying that qualia are spatially extensive is like saying that fire is continuous.
Why is there a binding problem for fire?
Because there is no fire in the ontology of modern physics and there are no laws of physics that say that some arrangement of atoms are fire. There are only extra-physical conventions that say that if atoms work approximately like fire you can say that fire reduces to atoms. That’s how reductionism works. It works the same way for observations—there are no physical laws that determine how precisely your measurement equipment must draw numbers for you to conclude your physical theory is correct. And it works the same way for qualia—there are no physical laws that say that some neural activity is your experience of blue.
Are you now saying that the binding comes from neurology?
Binding comes from a human desire to describe things in an approximate, useful way. Fundamentally, there is no binding between real physics and continuity of fire. And so the binding problem is an easy problem of scientifically describing a brain in enough precision that all pixels of your visual field are predictable from this description.
something is a WF doesn’t mean it is nonlocal or particularly spatially extensive , since WFs can bunch down to any finite size.
Sure.
Most of the electrons in the human body are localised to orbitals that are some nanometers across (But not localised within them).
But WF of a human is spatially extensive enough.
our sensorium should look like a fine grained brain scan
Why not like a drawing of a head?
Anyway, the binding problem for qualia is no different from the binding problem for fire. There is just no reason to promote limits of human introspection into fundamental ontology, just like there is no reason why fire can’t look continuous, but actually consist of mostly empty space.
Oh, ok, I misunderstood you.
or you could have the qualia instead of that (monistic panpsychism)
Physics is monistic panpsychism—there are no just geometric-causal-numerical ingredients, there is also implicit statement that universe that equations describe has intrinsic property of existence.
Yes, but why do you refuse to believe it? What’s your evidence that your experience of color is ontologically primitive? It’s just baseless assumption.
Physicalists who aren’t thoroughgoing eliminativists or illusionists, are actually dualists.
Can you imagine believing in dualistic non-physical parts of your experience that you are not aware of?
They mean that (there is more chance that) training will produce obedient AI that will help governments become more totalitarian and will not effectively pursue some very alien goal.
For people who have color vision, I can state it more concretely: color exists in reality, it doesn’t exist in physics, therefore physics is incomplete in some way.
You don’t have enough evidence of this. Nothing about your experience of color contradicts it being neurons. Do you agree, that you can have thoughts about your experience of color? Like “I’ve seen blue sky yesterday”. Do you agree that they can be more or less correct, like when you forgot, that actually it was very cloudy all day yesterday? Do you agree that you can describe you experience more or less precisely? Do you agree that your experience has structure? When you say that “color” exists you mean something, that works in specific ways. For example, it does not create blue-sky experiences on very cloudy days. And if you describe these ways precisely enough, you’ll get a description of neurons. What does you think a physical description of you describes, when it describes a difference between a state interpretable as you seeing a blue sky and a state interpretable as you seeing a cloudy sky?
Is it just that you refuse to believe that your experience has any parts you are not aware of?
I’m not a fan of platonism. Definitely not of a traditional platonism, as some separate additional category in fundamental ontology. Looks like something human mathematicians would come up to feel better about themselves. Even though it is an outside view reasoning, similar to the one people use to dismiss panpsychism—I still don’t see what’s the point, when you can just say that any instance of math working is a physical fact.
The mathematical universe is more likely, but I’m not even sure it is more simple hypothesis, than some other, not so mathy physics.
Assuming it, I can see how not having to worry about existence of high-level abstractions can help. It’s just funny, because “but it IS some other territory” is very overpowered argument. Causality gets weird, but platonists probably love acausual stuff, so whatever. Personally, in this scenario, I worry that mathematical universe doesn’t give existence to some abstraction and so if you rely on this, you can still get zombies on some level. Probably it’s not so limited, but even then, are you supposed to be able to constrain mathematical universe by thinking about abstractions in our world?
Again, this is all correct. Well, except level 6. But level 6 is hilarious.
A physicalist, if I understand correctly, could consistently claim that such an experiment is deluding the subject, essentially doing something like modifying the memory of the experience so that they inaccurately feel the same, when in fact there was a difference.
It’s all arbitrary ethics. You can already say that changing location deludes you. Suddenly starting to care about complexity is just letting your epistemology bleed into your values.
C1 wants to say that worlds which are structurally isomorphic are literally the same world.
I don’t think this is a typical or correct view, if you factor existence out of structure. People believe in reality. “Shut up and calculate” has a name precisely because it’s not a universal position. There is a physical difference between real and fictional chair, even if you describe them as having identical structure. It’s just that usually existence is implicit—physics doesn’t talk about fictional chairs. C1 doesn’t have an answer to “relations need relata” because “relations need relata” is correct.
And so is “blue is like a chair”.
They’re arguing that conscious experience of blue and red gives evidence of something that doesn’t purely fit the causal/functional role in the way a chair does.
Yeah, but they don’t have a strong argument. I’m not sure what is a rigorous way to show that argument from conceivability of world B fails, if we accept the framework of conceivability arguments. Rules of counterfactual behavior are rules of physics and so worlds have different relations, maybe? But I don’t believe conceivability arguments are that rigorously justified in the first place. I accept them in case of zombies, mostly because there is a broadly physicalist solution—zombies are different in that they don’t exist. But in the blue/red case you can conceive of a functionally same chair that exists differently as much as you can conceive of spectrum inversion. You don’t even need to be unphysical about it—antimatter chair from an antimatter-dominated world counterfactually annihilates if you bring it to our world.
And more importantly, like C1 says, parsimony—there is no need to think about different kinds of existence, when you can explain everything with one kind. You agree that if we grant intrinsic property of existence, then third-person descriptions describe first-person experience as completely as they describe chair? Because then neurons and atoms are just more precise description of the same reality that you call “I’m seeing blue”. C2 doesn’t have evidence or arguments that say that “blue” is not neurons, if neurons (are high-level description of reality that) intrinsically exists. But then all differences between blue and red are describable by relations (that are about things that exists) and so arguments about inverted spectrum should not change anything.
If you start to say that some “intrinsic property” is needed to realise the structure then C2 has an opening to claim this is the categorical protophenomenal property required to fix phenomenal character.
Well, there isn’t much that makes it “phenomenal”. Chairs also exist. And it’s not unphysical to say that things exist. It supposed to feel acceptable by everyone by design^^. And if you accept it, all phenomenal structure—all differences between red and blue and all first-person descriptions—are as completely describable by relational physics as chairs. In the end physicalist can say it’s not that consciousness maps to existence, it’s just that people confused consciousness with different, perfectly physical concept of existence.
Ensuring that you get good generalization, and that models are doing things for the right reasons, is easy when you can directly verify what generalization you’re getting and directly inspect what reasons models have for doing things. And currently, all of the cases where we’ve inadvertently selected for misaligned personas—alignment faking, agentic misalignment, etc.—are cases where the misaligned personas are easy to detect: they put the misaligned reasoning directly in their chain-of-thought, they’re overtly misaligned rather than hiding it well, and we can generate fake scenarios that elicit their misalignment.
But visible misalignment being easy to detect and correlated with misaligned chain-of-thought doesn’t guarantee that training that eliminates visible misalignment and misaligned chain-of-thought results in a model that does things for the right reasons? The model can still learn unintended heuristics. And what’s the actual hypothesis about model’s reasons when they appear to be right? Its learned reasoning algorithm is isomorphic to a reasoning algorithm of a helpful human that reads same instructions, or what?
What misunderstanding?
“Mostly” allows for some people to read about corrigibility and think “yeah, I’ll do it”, or whatever you think would be a counterexample to goal directedness of humans.
Superhuman AI capable of takeover may still have (maybe intentional) cognitive limitations/whatever humans have—speed of convergence relative to takeover difficulty is still a free parameter.