Charbel-Raphael Segerie
https://crsegerie.github.io/
Living in Paris
There is no fire alarm for AGIs? Maybe just subscribe to the DeepMind RSS feed…
On a more serious note, I’m curious about the internal review process for this article, what role did the DeepMind AI safety team play in it? In the Acknowledgements, there is no mention of their contribution.
Thank you for writing this, Alexandre. I am very happy that this is now public, and some paragraphs in part 2 are really nice gems.
I think parts 1 and 2 are a must read for anyone who wants to work on alignment, and articulate dynamics that I think extremely important.
Parts 3-4-5, which focus more on Conjecture, are more optional in my opinion, and could have been another post, but are still interesting. This has changed my opinion of Conjecture and I see much more coherence in their agenda. My previous understanding of Conjecture’s plan was mostly focused on their technical agenda, CoEm, as presented in a section here. However, I was missing the big picture. This is much better.
I agree that I haven’t argued the positive case for more governance/coordination work (and that’s why I hope to do a next post on that).
We do need alignment work, but I think the current allocation is too focused on alignment, whereas AI X-Risks could arrive in the near future. I’ll be happy to reinvest in alignment work once we’re sure we can avoid X-Risks from misuses and grossly negligent accidents.
This post would have been far more productive if it had focused on exploring them.
So the sections “Counteracting deception with only interp is not the only approach” and “Preventive measures against deception”, “Cognitive Emulations” and “Technical Agendas with better ToI” don’t feel productive? It seems to me that it’s already a good list of neglected research agendas. So I don’t understand.
if you hadn’t listed it as “Perhaps the main problem I have with interp”
In the above comment, I only agree with “we shouldn’t do useful work, because then it will encourage other people to do bad things”, and I don’t agree with your critique of “Perhaps the main problem I have with interp...” which I think is not justified enough.
I think this is very important, and this goes directly into the textbook from the future.
it seems to me that if we are careful enough and if we implement a lot of the things that are outlined in this agenda, we probably won’t die. But we need to do it.
That is why safety culture is now the bottleneck: if we are careful, we are an order of magnitude safer than if we are not. And I believe this is the most important variable.
To give props to your last paragraphs, you are right about my concern that most alignment work is less important than governance work. Most of the funding in AI safety goes to alignment, AI governance is comparatively neglected, and I’m not sure that’s the best allocation of resources. I decided to write this post specifically on interpretability as a comparatively narrow target to train my writing.
I hope to work on a more constructive post, detailing constructive strategic considerations and suggesting areas of work and theories of impact that I think are most productive for reducing X-risks. I hope that such a post would be the ideal place for more constructive conversations, although I doubt that I am the best suited person to write it.
There is also davidad’s Open Agency Architecture
I have tried Camille’s in-person workshop in the past and was very happy with it. I highly recommend it. It helped me discover many unknown unknowns.
Frontpage comment guidelines:
Aim to explain, not persuade
Try to offer concrete models and predictions
If you disagree, try getting curious about what your partner is thinking
Don’t be afraid to say ‘oops’ and change your mind
Thank you! Yes, for most of these issues, it’s possible to create GIFs or at least pictograms. I can see the value this could bring to decision-makers.
However, even though I am quite honored, it’s not because I wrote this post that I am the best person to do this kind of work. So, if anyone is inspired to work on this, feel free to send me a private message.
I read this book two years ago when it was published in French. I found it incredibly exciting to read, and that’s what motivated me to discover this site and then move on to a master’s degree in machine learning.
This book saved me a lot of time in discovering Bayesianism, and made a much deeper change in my way of thinking than if I had simply read a textbook of Bayesian machine learning.
I am of course happy to have read the sequences, but I think I am lucky to have started with the equation of knowledge which is much shorter to read and which provides the theoretical assurances, motivation, main tools, enthusiasm and pedagogy to engage in the quest for Bayesianism.
If Facebook AI research is such a threat, wouldn’t it be possible to talk to Yann LeCun?
Ah, “The tool-assisted attack went from taking 13 minutes to taking 26 minutes per example.”
Interesting. Changing the in-distribution (3oom) does not influences much the out-distribution (*2)
Could you just explain a bit “will only be likely to contain arbitrary patterns of sizes up to log(10^120)” please ? Or give some pointers with other usage of such calculation ?
It’s not strong evidence; it’s a big mess, and it seems really difficult to have any kind of confidence in such a fast-changing world. It feels to me that it’s going to be a roughly 50⁄50 bet. Saying the probability is 1% requires much more work that I’m not seeing, even if I appreciate what you are putting up.
On the offense-defense balance, there is no clear winner in the comment sections here, neither here. We’ve already seen a takeover between two different roughly equal human civilizations (see the story of the conquistadors) under certain circumstances. And AGI is at least more dangerous than nuclear weapons, and we came pretty close to nuclear war several times. Covid seems to come from gain of function research, etc...
On fast vs slow takeoff, it seems to me that fast takeoff breaks a lot of your assumptions, and I would assign much more than a 1% probability for fast takeoff. Even when you still embrace the compute-centric framework (which I find conservative), you still get wild numbers, like a two-digits probability of takeoff lasting less than a year. If so, we won’t have the time to implement defense strategies.
I really enjoyed this article and I’m now convinced that a better understanding of steganography could be a crucial part of OpenAI’s plan.
Communicating only 1 bit may be enough for AIs to coordinate on a sudden takeover: this is not a regime where restricting AIs to a small capacity is sufficient for safety
I’ve never thought about this. Scary.
Why won’t this alignment idea work?
Researchers have already succeeded in creating face detection systems from scratch, by coding the features one by one, by hand. The algorithm they coded was not perfect, but was sufficient to be used industrially in digital cameras of the last decade.
The brain’s face recognition algorithm is not perfect either. It has a tendency to create false positives, which explains a good part of the paranormal phenomena. The other hard-coded networks of the brain seem to rely on the same kind of heuristics, hard-coded by evolution, and imperfect.
However, it turns out that humans, despite these imperfect evolutionary heuristics, are generally cooperative and friendly.
This suggests that the seed of alignment can be roughly coded and yet work.
1. Can’t we replicate the kind of research effort of hand-crafting human detectors, and hand-crafting “friendly” behaviour?
2. Nowadays, this quest would be facilitated by deep learning: no need to hand-craft a baby detector, just train a neural network that recognizes babies and triggers a reaction at a certain threshold that releases the hormones of tenderness. There is no need to code the detector, just train it. And then, only the reaction corresponding to the tenderness hormone must be coded.
3. By this process, there will be gaping holes, which will have to be covered one by one. But this is certainly what happened during evolution.
The problems are:
- We are not allowed to iterate with a strong AI
- We are not sure that this would extrapolate well to higher levels of capability
Ok
But if we were to work on it today, it would only have a sub-human level, and we could iterate like on a child. And even if we had the complete code of the brain stem, and we had “Reverse-enginered human social instincts” as Steven Byrnes proposes here, it seems to me that we still would have to do all this.
What do you think?
[We don’t think this long term vision is a core part of constructability, this is why we didn’t put it in the main post]
We are unsure, but here are several possibilities.
Constructability could lead to different possibilities depending on how well it works, from most to less ambitious:
Using GPT-6 to implement GPT-7-white-box (foom?)
Using GPT-6 to implement GPT-6-white-box
Using GPT-6 to implement GPT-4-white-box
Using GPT-6 to implement Alexa++, a humanoid housekeeper robot that cannot learn
Using GPT-6 to implement AlexNet-white-box
Using GPT-6 to implement a transparent expert system that filters CVs without using protected features
Comprehensive AI services path
We aim to reach the level of Alexa++, which would already be very useful: No more breaking your back to pick up potatoes. Compared to the robot Figure01, which could kill you if your neighbor jailbreaks it, our robot seems safer and would not have the capacity to kill, but only put the plates in the dishwasher, in the same way that today’s Alexa cannot insult you.
Fully autonomous AGI, even if transparent, is too dangerous. We think that aiming for something like Comprehensive AI Services would be safer. Our plan would be part of this, allowing for the creation of many small capable AIs that may compose together (for instance, in the case of a humanoid housekeeper, having one function to do the dishes, one function to walk the dog, …).
Alexa++ is not an AGI but is already fine. It even knows how to do a backflip Boston dynamics style. Not enough for a pivotal act, but so stylish. We can probably have a nice world without AGI in the wild.
The Liberation path
Another possible moonshot theory of impact would be to replace GPT-7 with GPT-7-plain-code. Maybe there’s a “liberation speed n” at which we can use GPT-n to directly code GPT-p with p>n. That would be super cool because this would free us from deep learning.
Guided meditation path
You are not really enlightened if you are not able to code yourself.
Maybe we don’t need to use something as powerful as GPT-7 to begin this journey.
We think that with significant human guidance, and by iterating many many times, we could meander iteratively towards a progressive deconstruction of GPT-5.
Going from GPT-5 to GPT-2-hybrid seems possible to us.
Improving GPT-2-hybrid to GPT-3-hybrid may be possible with the help of GPT-5?
...
If successful, this path could unlock the development of future AIs using constructability instead of deep learning. If constructability done right is more data efficient than deep learning, it could simply replace deep learning and become the dominant paradigm. This would be a much better endgame position for humans to control and develop future advanced AIs.
Path | Feasibility | Safety |
---|---|---|
Comprehensive AI Services | Very feasible | Very safe but unstable in the very long run |
Liberation | Feasible | Unsafe but could enable a pivotal act that makes things stable in the long run |
Guided Meditation | Very Hard | Fairly safe and could unlock a safer tech than deep learning which results in a better end-game position for humanity. |
What type of reasoning do you think would be most appropriate?
This proves too much. The only way to determine whether a research direction is promising or not is through object-level arguments. I don’t see how we can proceed without scrutinizing the agendas and listing the main difficulties.
I don’t think it’s that simple. We have to weigh the good against the bad, and I’d like to see some object-level explanations for why the bad doesn’t outweigh the good, and why the problem is sufficiently tractable.
Maybe. I would still argue that other research avenues are neglected in the community.
I provided plenty of technical research direction in the “preventive measures” section, this should also qualifies as forward-chaining. And interp is certainly not the only way to understand the world better. Besides, I didn’t say we should stop Interp research altogether, just consider other avenues.
I think I agree, but this is only one of the many points in my post.