I’m an artist, writer, and human being.
To be a little more precise: I make video games, edit Wikipedia, and write here on LessWrong!
I’m an artist, writer, and human being.
To be a little more precise: I make video games, edit Wikipedia, and write here on LessWrong!
????
I’m really intrigued by this idea! It seems very similar to past thoughts I’ve had about “blackmailing” the AI, but with a more positive spin
Thanks for the fascinating response! It’s intriguing that we don’t have more or better-tested “emergency” measures on-hand; do you think there’s value in specifically working on quickly-implementable alignment models, or would that be a waste of time?
Let’s say it’s to be a good conversationalist (in the vein of the GPT series) or something—feel free to insert your own goal here, since this is meant as an intuition pump, and if you can answer better if it’s already a specific goal, then let’s go with that.
On reflection, I see your point, and will cross that section out for now, with the caveat that there may be variants of this idea which have significant safety value.
The goal would be to start any experiment which might plausibly lead to AGI with a metaphorical gun to the computer’s head, such that being less than (observably) perfectly honest with us, “pulling any funny business,” etc. would lead to its destruction. If you can make it so that the path of least resistance to make it safely out of a box is to cooperate rather than try to defect and risk getting caught, you should be able to productively manipulate AGIs in many (albeit not all) possible worlds. Obviously this should be done on top of other alignment methods, but I doubt it would hurt things much, and would likely help as a significant buffer.
I wasn’t aware you were offering a bounty! I rarely check people’s profile pages unless I need to contact them privately, so it might be worth mentioning this at the beginning or end of posts where it might be relevant.
Are there any good introductions to the practice of writing in this format?
This is an excellent point actually, though I’m not sure I fully agree (sometimes lack of information could end up being far worse, especially if people think we’re further along than we really are, and try to get into an “arms race” of sorts)
Interesting! I do wish you were able to talk more openly about this (I think a lot of confusion is coming from lack of public information about how LaMDA works), but that’s useful to know at least. Is there any truth to the claim that there’s any real-time updating going on, or is that false as well?
This is a really high-quality comment, and I hope that at least some expert can take the time to either convincingly argue against it, or help confirm it somehow.
AIs do not have survival instincts by default
I think a “survival instinct” would be a higher order convergent value than “kill all humans,” no?
lol well now it needs to be one!
Quick thought—has there been any focus on research investigating the difference between empathetic and psychopathic people? I wonder if that could help us better understand alignment…
This is a really important post I think, and I hope it gets seen by the right people! Developing a culture in which we can trust each other is really essential, and I do wish there was more focus on progress being viewable from an outside perspective.
Imagine then that LaMDA was a completely black box model, and the output was such that you would be convinced of its sentience. This is admittedly a different scenario than what actually happened, but should be enough to provide an intuition pump
This reply sounds way scarier than what you originally posted lol. I don’t think a CEO would be too concerned by what you wrote (given the context), but now there’s the creepy sense of the infohazardous unknown