Come to think of it, zebras are the closest thing we have to such adversarially colored animals. Imagine if they also were flashing at 17 Hz, optimal epilepsy inducing frequency according to this paper: https://onlinelibrary.wiley.com/doi/10.1111/j.1528-1167.2005.31405.x
Canaletto
Well, both, but mostly the issue of it being somewhat evil. But it can probably be good even from strategic human focused view, by giving assurances to agents who otherwise would be adversarial to us. It’s not that clear to me it’s a really good strategy, because it kinda incentivizes threats, where you are trying to find more destructive ways to get more reward by surrendering. Also, it can just try it first if there are any ways with safe failures, and then opt out to cooperate? Sounds difficult to get right.
And to be clear it’s a hard problem anyway, like even without this explicit thinking this stuff is churning in the background, or will be. It’s a really general issue.Check this writeup, I mostly agree with everything there:
https://www.lesswrong.com/posts/cxuzALcmucCndYv4a/daniel-kokotajlo-s-shortform?commentId=y4mLnpAvbcBbW4psB
It’s really hard for humans to match the style / presentation / language without putting a lot of work into understanding the target of the comment. LLMs are inherently worse (right now) at doing the understanding, coming up with things worth saying, being calibrated about being critical AND they are a lot better at just imitating the style.
This just invalidates some side signals humans habitually use on one another.
This should be probably only attempted with clear and huge warning that it’s a LLM authored comment. Because LLMs are good at matching style without matching the content, it could go with exploiting heuristics of the users calibrated only for human level of honesty / reliability / non-bulshitting.
Also check this comment about how conditioning on the karma score can give you hallucinated strong evidence:
Thoughts about what kinds of virtues are relevant in context of LLMs.
Suppose you are writing a simulation. You keep optimizing it and hardcoding some stuff and handle different cases more efficiently and everything. One day your simulation becomes efficient enough so that you can run big enough grid for long enough, and there develops life. Then intelligent life. Then they tried to figure out the physics of their universe, and they succeed! But, oh wait, their description is extremely short but completely computationally intractable.
Can you say that they actually figured out in what kind of universe they are in already, or should you wait for when they discover another million lines of code of optimization for it? Should you create giant sparkling letters “congratulations, you figured that out” or wait for more efficient formulation?
The alternative is to pit people against each other in some competitive games, 1 on 1 or in teams. I don’t think the feeling you get from such games is consistent with “being competent doesn’t feel like being competent, it feels like the thing just being really easy”, probably mainly because there is skill level matching, there are always opponents who pose you a real challenge.
Hmm maybe such games need some more long tail probabilistic matching, to sometimes feel the difference. Or maybe variable team sizes, with many incompetent people versus few competent, to get a more “doomguy” feeling.
Completely agree. It’s more like a utility function for a really weird inhuman kind of agent. That agent finds it obvious that if you had a chance to painlessly kill all humans and replace them with aliens who are 50% happier and 50% more numerous it would be a wonderful and exiting opportunity. Like, it’s hard to overstate how weird utilitarianism is. And this agent will find it really painful and regretful to be confined by strategic considerations of “the humans would fight you really hard, so you should promise not to do it”. Where as humans find it relieving? or something.
Utilitarianism indeed is just a very crude proxy.
I really like that description! I think the core problem here can be summarized as “Accidently by reinforcing for goal A, then for goal B, you can create A-wanter, that then spoofs your goal-B reinforcement and goes on taking A-aligned actions.” It can even happen just randomly, just from ordering of situations/problems you present it with in training, I think.
I think this might require some sort of internalization of reward or a model of the training setup. And maybe self location—like how the world looks with the model embedded in it. It could also involve detecting the distinction between “situation made up solely for training”, “deployment that will end up in training” and “unrewarded deployment”.Also, maybe this story could be added to Step 3:
“The model initially had a guess about the objective, which was useful for a long time but eventually got falsified. Instead of discarding it, the model adopted it as a goal and became deceptive.”
[edit]
Aslo it kind of ignores that rl signal is quite weak, model can learn something like “to go from A to B you need to jiggle in this random pattern and then take 5 steps left and 3 forward” instead of “take 5 steps left and 3 forward”, maybe it works like that for goals too. So, when AI will be used in a lot of actual work (Step 5), they could saturate actually useful goals and then spend all the energy in solar system on dumb jiggling.
I think it might be actual position of Yudkowsky? like, if you summarize it really hard.
I think the thing with talent is that it’s a useful and straightforward signal of quality you can obtain without investing a whole lot of resources into evaluation/reading/research.
Same with awards, recommendations from famous people, popularity scores and so on.
And it’s probably reasonable to feel a bit sad when some source of this signal gets invalidated.
Just don’t go to far with it? Like, if someone wrote a book while holding a pen with their toes while doing a headstand, it’s not a good signal that the book will be of any interest to you.
I think you also have to factor in selection bias. Like suppose there are 3 organizations with 100 resource units, 10 with 20 units, 30 with 5 units. And maybe resources are helpful, but not helpful enough that all the advancements will concentrate in the top 3.
I would really love if some “let’s make asi” people put some effort into making bad outcomes less bad. Like, it would really suck if we are going to be trapped in endless corporate punk hell, with superintelligent nannies with correct (tm) opinions. Or infinite wedding parties or whatever. Just make sure that if you fuck up we all just get eaten by nanobots please. Permanent entrapment in misery would be a lot worse.
I don’t know if it’s applicable? Like, I’m asking for The Favorite color, not “suggest me random cool color please”. I should probably test that too.
https://ygo-assets-websites-editorial-emea.yougov.net/documents/tabs_OPI_color_20141027_2.pdf
According to these two (dubious) surveys I just found 30% of humans pick Blue, 15% Purple, 10% Green. It’s not particularly far from human distribution (if you are guessing that they should just say the favorite colors humans are saying). A bit more skewed to blue, and further from red.
Favorite colors of some LLMs.
Do you want joy or to know what things are out there? Like it’s a fundamental question about justifications, do you use joy to keep yourself going while you gain understanding or you gain understanding to get some high quality joy?
That sounds like two different kinds of creatures in transhumanist limit of it, some trade off knowledge to joy, others trade off joy to knowledge.
Or whatever, not necessarily “understanding”, like you can use other properties of your territory to bind yourself to. Well, in terms of maps it’s preference for good correspondence, and preference for not spoofing that preference.
Also just on priors, consider how unproductive and messy, mostly talking about who said what and analyzing virtues of participants, the conversation caused by this post and its author was. I think even without reading it it’s an indicator of somewhat doubtful origin for a set of prescriptivist guidelines.
Shameless self promotion: this one https://www.lesswrong.com/posts/ASmcQYbhcyu5TuXz6/llms-could-be-as-conscious-as-human-emulations-potentially
It circumvents object level question and instead looks at epistemic one.
This one is about broader direction in “how the things that happened change attitudes and opinions of people”
https://www.astralcodexten.com/p/sakana-strawberry-and-scary-ai
This one too, about consciousness in particular
https://dynomight.net/consciousness/
I think it’s somewhat productive direction explored in these 3 posts, but it’s not like very object level, more about epistemics of it all. I think you can look up how like LLM states overlap / predict / correspond with brain scans of people who engage in some tasks? I think there were a couple of paper on that.
E.g. here https://www.neuroai.science/p/brain-scores-dont-mean-what-we-think
Yeah! My point is more “let’s make it so that the possible failures on the way there are graceful”. Like, IF you made par-human agent that wants to, I don’t know, spam the internet with letter M, you don’t just delete it or rewrite it to be helpful, harmless, and honest instead, like it’s nothing. So we can look back at this time and say “yeah, we made a lot of mad science creatures on the way there, but at least we treated them nicely”.
There’s a bit of disconnect from statement “it all adds up to normality” and how it actually could look like. Suppose someone discovered new physical law, if you say “fjrjfjjddjjsje” out loud you can fly, somehow. Now you are reproducing all the old model’s confirmed behaviour while flying. Is this normal?