It seems that with a tool-AI like GPT-N, the solution would probably be to dramatically restrict its use to its designers, who should immediately ask it how to solve alignment, which by assumption it can do. The real risk is in making the tool-AI public.
Razied
[Question] What should an Einstein-like figure in Machine Learning do?
Oh I have no doubt that this is no guarantee of safety, but with the likelihood of AGI being something like GPT-N going up (and the solution to Alignment being nowhere in sight), I’m trying to think of purely practical solutions to push the risks as low as they will go. Something like keeping the model parameters secret, maybe not even publicizing the fact of its existence, using it only by committee, and only to attempt to solve alignment problems, whose proposed solutions are then checked by the Alignment community. Really the worst-case scenario is if we have something powerful enough to pose massive risks, but not powerful enough to help solve alignment, but that doesnt seem too likely to me. Or that the solution to alignment the AI proposes turns out to be really hard to check.
You can probably avoid the generation of crank works and fiction by training a new version of GPT in which every learning example is labeled with <year of publication> and <subject matter>, which GPT has access to when it predict an example. So if you then generate a prompt and condition of something like <year: 2040> <subject matter: peer-reviewed physics publication>, you can easily tell GPT to avoid fiction and crank works, as well as make it model future scientific progress.
How is this clever javascript code the most likely text continuation of the human’s question? GPT-N outputs text continuations, unless the human input is “here is malicious javascript code, which hijacks the browser when displayed and takes over the world: … ” then GPT-N will not output something like it. In fact, such code is quite hard to write, and would not really be what a human would write in response to that question, so you’d need to do some really hard work to actually get something like GPT-N (assuming same training setup as GPT-3) to actually output malicious code. Of course, some idiot might in fact ask that question, and then we’re screwed.
future versions of such models could well work in other ways than text continuation, but this would require new ideas not present in the current way these models are trained, which is literally by trying to maximise the probability they assign to the true next word in the dataset. I think the abstraction of “GPT-N” is useful if it refers to a simply scaled up version of GPT-3, no clever additional tricks, no new paradigms, just the same thing with more parameters and more data, if you don’t assume this then “GPT-N” is no more specific than “Deep Learning-based AGI”, and we must then only talk in very general terms.
Regarding the exploits, you need to massage your question in a way that GPT-N predicts that its answer is the most likely thing that a human would write after your question. Over the whole internet, most of the time when someone asks someone else to answer a really hard question, the human who writes the text immediatly after that question will either a) be wrong or b) avoid the question. GPT-N isn’t trying to be right, to it, avoiding your question or being wrong is perfectly fine, because that’s what it was trained to output after hard questions.
To generate such an exploit, you need to convince GPT-N that the text it is being shown is actually coming from really competent humans, so you might try to frame your question as the beginning of a computer science paper, maybe written far in the future, and which has lots of citations, written by a collaboration of people GPT-N knows are competent. But then GPT-N might predict that those humans would not publish such a dangerous exploit, so it would yet again evade you. After a bit of trial and error, you might well corner GPT-N into producing what you want, but it will not be easy.
The practical problem with that is probably that you need to manually decide which papers go in which category. GPT needs such an enormous amount of data that any curating done needs to be automated. So metadata like authors, subject, date, website of provenance are quite easy to obtain for each example, but really high level stuff like “paper is about applying the methods of field X in field Y” is really hard.
My estimate for the most likely Good Path is something like the following:
1- build a superhuman-level GPT-N
2- enforce absolute secrecy and very heavily restrict access to the model.
3- patch obvious security holes.
4- make it model future progress in AI safety by asking it to predict the contents of highly cited papers from the 2050s.
5- rigorously vet and prove the contents of those papers
6- build safe AGI from those papers
A trick for Safer GPT-N
I am not sure at all about a specific probability for this exact chain of events. I think the secrecy part is quite likely (90%) to happen once a lab actually gets something human level, no matter their commitment to openness, i think seeing their model become truly human-level would scare the shit out of them. Patching obvious security holes also seems 90% likely to me, even Yann Lecun would do that. The real uncertainties come from whether the lab would try to use the model to solve AI safety, or whether they would think their security patches are enough, and push for monetizing the model directly, I’m pretty sure Deepmind and OpenAI would do something like that, I’m unsure about the others.
Regarding the probability of transformative AI being prosaic, i’m thinking 80%. GPT-3 has basically guaranteed that we will explore that particular approach as far as it can go. When I look at all the ways that I can think of making GPT better, of training it faster, of merging image and video understanding into it, of giving it access to true Metadata for each example, longer context length, etc. I see just how easy it is to improve it.
I am completely unsure about timelines. I have a small project going on where I’ll try to get a timeline probability estimate from estimates of the following factors:
-
cheapness of compute (including next generation computing possibilities)
-
data growth. Text, video, images, games, vr interaction
-
investment rate (application vs leading research)
-
Response of investment rate to increased progress
-
Response of compute availability to investment
-
researcher numbers as a function of increased progress
-
different approaches that could lead to AGI (simulation: minecraft style. Joint Text comprehension with image and video understanding, generative stuff?)
-
level of compute required for AGI
-
effect of compute availability on speed of algorithm discovery (architecture search)
-
discovery of new model architectures
-
discovery of new training algorithms
-
discovery of new approaches(like GANs, alphaZero, etc.)
-
switch to secrecy and impact on speed of progress
-
impact of safety concerns on speed
-
Here are the problems I found (so far) with this approach and my attempt at solutions:
Problem 1: We won’t think of all possibilities for bad stuff happening, and we might miss a crucial failure mode in our text corpus.
Solution 1: We might try to outsource the creation of some of those possibilities to GPT-N. This would be a much bigger problem if GPT were explicitly optimizing against us, but it isn’t.
Problem 2: GPT-N might predict that the world ends in such a way that stuff like the headlines of newspapers don’t change, or that things go badly rapidly enough that humanity doesn’t have time to notice it.
Solution 1: we can ask GPT-N for the probabilities at fine-grained time slices, and of text modalities that respond quickly to events, like tweets or forum posts.
Solution 2: for the case where things go badly rapidly enough that we don’t notice it at all, we can ask GPT-N for the probability of pictures (say a picture of downtown New York, still intact with living humans). Pictures respond faster to events in the real world. The only way to beat this is with something like a complete takeover of all forms of electronic communication at the same time. Or with total complete destruction of everything at the same time, ideally we need to have one of the cameras up on a satellite, taking pictures of the earth, so it wouldn’t be immediately destroyed.
Solution 3: we need to sign all the items in our dataset with cryptographic keys stored on computers in Faraday cages. A Twitter account which publishes signed messages/pictures every day or so should be made by organizations and serve as “dead man indicators”. This ensures that a potential fast doom scenario cannot only be electronic, it needs to be physical.
Problem 3: The optimization pressure scales with the number of questions we ask of GPT-N, especially if the questions we ask are informed by GPT’s past refusals or answers. If we get to a situation where GPT refuses to answer all our questions, and we keep asking them, then we are optimizing against the safety conditions.
Solution 1: this means that pretty much the first thing we need to ask GPT-N is how to safely extend the number of questions we can ask.
This naively seems like it should have a large positive impact on the valuation of gene-editing companies, no? Solving the protein-folding problem means that there is one less mountain standing between gene-editing and true nanotech.
I don’t think quantitative methods are always a victory for people. They are certainly a victory for the entity using them, helping it achieve its goals, but not they are no necessarily good for humanity as a whole. My goals don’t align with Google’s, so every bit of increase in youtube’s ability to recommend me videos that successfully waste my time is a negative for me.
Here are two things you probably haven’t heard of:
All my life, in low-light situations (and recently in all lightning) I’ve been able to see a persistent “visual noise” in my visual field, if you open your phone camera and point it towards a very low light place, you’ll see the same effect that I have. Recently I found out that apparently not everyone has this (though all my family does), I always thought that the visual noise was just a fundamental property of a finite eye, just as it is for a camera lens, but apparently some people’s brains filter it out and they just see a completely solid blob of colour.
If I keep my eyes locked on a particular point for some time, the objects in the periphery kind of stop being differentiated and they all just become a sort of uniform fuzzy vibrating luminosity, this effect is accentuated for me during psychedelic mushroom trips, but it happens in daily life all the time (like literally right now). Apparently not everyone has this.
I find this completely and utterly ridiculous. From a relative outsider perspective, letting an ant problem get out of hand because you have moral reservations about killing them is the sort of thing that’s the problem with people in this community. For one, letting the problem foster means that they probably had time to reproduce, meaning you ended up killing more than if you had just killed them at the beginning. Second, I think that most of the moral value of ants is in the fact that humans appreciate them, by far the greatest moral suffering here is your suffering at having killed them. The time spent thinking about this could have been spent making money and donating to save the ants, or any of the other things that are more valuable than they are. Considering the inevitability of suffering in our world is something that makes sense to do once a year, when you make sure that your life plan is right, the opportunity cost in thinking about the ant’s suffering is bigger than the moral cost of their suffering.
I completely agree about compassion, and I regularly dedicate a part of my meditation practice to metta, but there is a tension here between cultivating a compassionate mind-state and being effective enough in the world to act on that compassion, I think OP’s situation is firmly in the “too compassionate” camp. The mind-states of Tibetan monks in Himalayan caves might be sublime beyond belief, their minds containing gigantic amounts of compassion, yet they have no meaningful effect on the world outside their cave. Saving insects to cultivate compassion in yourself does make sense, but we shouldn’t fool ourselves into thinking that saving them is the best thing to do from a moral stand-point.
As defined, I think my cheerful price for many purposes would be extremely high, like 50$ for giving you the cup of coffee I just bought from the Starbucks across the street. However, it just seems rude to name a price that high to a friend, my instincts to not offend a friend are driving the price I would say downwards, maybe you are trying to not expend friendship capital by asking me for my cheerful price, but naming a high price feels to me like I’m expending friendship capital. And in fact there might be some part of me that resents you for asking me to name a cheerful price, so you’re expending friendship capital in just asking for my cheerful price. If I do actually want to make the trade, I’m also thinking of the likelihood that you’ll stop bargaining once you find out that my cheerful price is too high, which drives the number I’ll say still more downward. My point is that it’s basically impossible to not expend friendship capital when asking someone to name any price at all.
I expect any version of “align narrowly superhuman models” which evaluates the success of the project entirely by human feedback to be completely and totally doomed, at-best useless and at-worst actively harmful to the broader project of alignment
There are plenty of problems where evaluating a solution is way way easier than finding the solution. I’m doubtful that the model could somehow produce a “looks good to a human but doesn’t work” solution to “what is a room-temperature superconductor?”. I agree that for biological problems the issue is much more concerning, and certainly for any kind of societal problem, but as long as we stay close to math, physics and chemistry, “looks good to a human” and “works” are pretty closely related to each other.
Though even there, his lectures are famous for only being truly appreciated after you’ve first learned the material elsewhere. They are incredibly good at giving you the feeling of understanding but quite a bit less good at actually teaching problem-solving. When reading them, it was a common occurrence for me to read a chapter and believe the subject was the most straightforward and natural thing in the world, only to be completely mystified by the problems.
Always great to see contemplative practice being represented here! If people want a quantitative estimate of the value of doing these practices, I’ve heard Shinzen Young say a few times that he’d rather have one day of his experience than 20 years of rich/famous/powerful life. Culadasa has agreed with this estimate in public, and Daniel Ingram agreed right away to this to me in private video chat. This places the lower bound of meditation-induced increase in life satisfaction somewhere around 4 orders of magnitude.