Clearly a squiggle-maximizer would not be an average squigglean
Why??? Being expected squiggle maximizer literally means that you implement policy that produces maximum average number of squiggles across the multiverse.
Clearly a squiggle-maximizer would not be an average squigglean
Why??? Being expected squiggle maximizer literally means that you implement policy that produces maximum average number of squiggles across the multiverse.
Definition given in post:
I am looking for a discussion of evidence that the LLMs internal “true” motivation or reasoning system is very different from human, despite the human output, and that in outlying environmental conditions, very different from the training environment, it will behave very differently.
I think my example counts.
I wouldn’t say that’s exactly best argument but for example
I mean spiders.
You can mention Portia, which can emulate mammal predators’ behavior using brain much smaller.
Isn’t counterfactual mugging (including logical variant) just a prediction “would you bet your money on this question”? Betting itself requires updatelessness—if you don’t pay predictably after losing bet, nobody will propose bet to you.
“Lesson overall” can contain idiosyncratic facts that you can learn iff you run into problem and try to solve it, you can’t know them (assuming you are human and not AIXI) in advance. But you can ask yourself “how would someone with better decision-making algorithm solve this problem having the same information as me before I tried to solve this problem” and update your decision-making algorithm accordingly.
Apparently, the dialogue is happening in inverted world—Ponzi schemes have never happened here and everybody agrees on AI X-risk problem.
It’s called “don’t build it”. Once you have what to delete, things can get complicated
I think difference between what you are describing and what is meant here is captured in this comment:
There’s a phenomenon where a gambler places their money on 32, and then the roulette wheel comes up 23, and they say “I’m such a fool; I should have bet 23”.
More useful would be to say “I’m such a fool; I should have noticed that the EV of this gamble is negative.” Now at least you aren’t asking for magic lottery powers.
Even more useful would be to say “I’m such a fool; I had three chances to notice that this bet was bad: when my partner was trying to explain EV to me; when I snuck out of the house and ignored a sense of guilt; and when I suppressed a qualm right before placing the bet. I should have paid attention in at least one of those cases and internalized the arguments about negative EV, before gambling my money.” Now at least you aren’t asking for magic cognitive powers.
Do you mean “If EY was good enough we would knew this trick many years ago”?
Why do you highlight status among bazilliion other things that generalized too, like romantic love, curiosity, altruism?
I think, the reason is that LLMs are not (pre)trained to do any particular practical task, they are “just trained to predict text”. “Pre” signifies that LLM is a “raw product”, not suitable for consumers.
The problem is not that you can “just meditate and come to good conclusions”, the problem is that “technical knowledge about actual machine learning results” doesn’t seem like good path either.
Like, we can get from NN trained to do modular addition the fact that it performs Fourier transform, because we clearly know what Fourier transform is, but I don’t see any clear path to get from neural network the fact that its output is both useful and safe, because we don’t have any practical operationalization of what “useful and safe” is. If we had solution to MIRI problem “which program being run on infinitely large computer produces aligned outcome”, we could try to understand how good NN in approximating this program, using aforementioned technical knowledge, and have substantial hope, for example.
I think there is nothing surprising that small community of nerds writing in spare time has blind spots, but when large professional community has such blind spots that’s surprising.
I would be more interested in list of sources, because I googled “targeted NIR interference therapy IQ” and the first post in Google is yours.
After thinking for a while, I decided that it’s better to describe level of capability not as”capable to kill you”, but “lethal by default output”. I.e.,
If ASI builds self-replicating in wide range of environments nanotech and doesn’t put specific protections from it turning humans into gray goo, you are dead by default;
If ASI optimizes economy to get +1000% productivity, without specific care about humans everyone dies;
If ASI builds Dyson sphere without specifc care about humans, see above;
More nuanced example: imagine that you have ASI smart enough to build high fidelity simulation of you inside of its cognitive process. Even if such ASI doesn’t pursue any long-term goals, if it is not aligned to, say, respect your mental autonomy, any act of communication is going to turn into literal brainwashing.
The problem with possibility to destroy planet or two is how hard to contain rogue ASI: if it is capable to destroy planet, it’s capable to eject several von Neumann probes which can strike before we can come up with defense, or send radiosignals with computer viruses or harmful memes or copies of ASI. But I think that if you have unhackable indistinguishable from real world simulation and you are somehow unhackable by ASI, you can eventually align it by simple methods from modern prosaic alignment. The problem is that you can’t say in advance which kind of finetuning you need, because you need generalization in advance in untested domains.
[One outcome with property X involves cognitive structures Y]
Do we know any other outcomes?
My evidence is how it is exactly happening.
If you look at actual alignment development and ask yourself “what am I see, at empirical level?”, you’ll get this scenario:
We reach new level of capabilities
We get new type of alignment failures
If this alignment failure doesn’t kill everyone, we can fix it even by very dumb methods, like “RLHF against failure outputs”, but it doesn’t tell us anything about kill-everyone level of capabilities.
I.e., I don’t expect AutoGPT to kill anyone, because AutoGPT is certainly not capable to do this. But I expect that AutoGPT got a bunch of failures unpredictable in advance.
Examples:
What happened to ChatGPT on release.
What happened to ChatGPT in slightly unusual environment despite all alignment training.
By “squiggle maximizer” I mean exactly “maximizer of number of physical objects such that function is_squiggle returns True on CIF-file of their structure”.
We can have different objects of value. Like, you can value “probability that if object in multiverse is a squiggle, it’s high-quality”. Here yes, you shouldn’t create additional low-quality squiggles. But I don’t see anything incoherent here, it’s just different utility function?