Charbel-Raphaël

Karma: 1,665

Charbel-Raphael Segerie

https://crsegerie.github.io/

Living in Paris

Charbel-Raphaël 18 Aug 2023 1:14 UTC
54 points
33
in reply to: Richard_Ngo’s comment on: Against Almost Every Theory of Impact of Interpretability
This seems like very much the wrong type of reasoning to do about novel scientific research. Big breakthroughs open up possibilities that are very hard to imagine before those breakthroughs.
What type of reasoning do you think would be most appropriate?
This proves too much. The only way to determine whether a research direction is promising or not is through object-level arguments. I don’t see how we can proceed without scrutinizing the agendas and listing the main difficulties.
this by itself is sufficient to recommend it.
I don’t think it’s that simple. We have to weigh the good against the bad, and I’d like to see some object-level explanations for why the bad doesn’t outweigh the good, and why the problem is sufficiently tractable.
Interpretability seems like our clear best bet for developing a more principled understanding of how deep learning works;
Maybe. I would still argue that other research avenues are neglected in the community.
not focused enough on forward-chaining to find the avenues of investigation which actually allow useful feedback and give us a hook into understanding the world better
I provided plenty of technical research direction in the “preventive measures” section, this should also qualifies as forward-chaining. And interp is certainly not the only way to understand the world better. Besides, I didn’t say we should stop Interp research altogether, just consider other avenues.
More generally, I’m strongly against arguments of the form “we shouldn’t do useful work, because then it will encourage other people to do bad things”. In theory arguments like these can sometimes be correct, but in practice perfect is often the enemy of the good.
I think I agree, but this is only one of the many points in my post.

Charbel-Raphaël 13 May 2022 0:30 UTC
51 points
on: Deepmind’s Gato: Generalist Agent
There is no fire alarm for AGIs? Maybe just subscribe to the DeepMind RSS feed…
On a more serious note, I’m curious about the internal review process for this article, what role did the DeepMind AI safety team play in it? In the Acknowledgements, there is no mention of their contribution.

Charbel-Raphaël 6 Feb 2024 19:39 UTC
20 points
1
on: My guess at Conjecture’s vision: triggering a narrative bifurcation
Thank you for writing this, Alexandre. I am very happy that this is now public, and some paragraphs in part 2 are really nice gems.
I think parts 1 and 2 are a must read for anyone who wants to work on alignment, and articulate dynamics that I think extremely important.
Parts 3-4-5, which focus more on Conjecture, are more optional in my opinion, and could have been another post, but are still interesting. This has changed my opinion of Conjecture and I see much more coherence in their agenda. My previous understanding of Conjecture’s plan was mostly focused on their technical agenda, CoEm, as presented in a section here. However, I was missing the big picture. This is much better.

Charbel-Raphaël 18 Aug 2023 23:26 UTC
12 points
10
in reply to: Richard_Ngo’s comment on: Against Almost Every Theory of Impact of Interpretability
I agree that I haven’t argued the positive case for more governance/coordination work (and that’s why I hope to do a next post on that).
We do need alignment work, but I think the current allocation is too focused on alignment, whereas AI X-Risks could arrive in the near future. I’ll be happy to reinvest in alignment work once we’re sure we can avoid X-Risks from misuses and grossly negligent accidents.

Charbel-Raphaël 18 Aug 2023 5:54 UTC
11 points
13
in reply to: Richard_Ngo’s comment on: Against Almost Every Theory of Impact of Interpretability
This post would have been far more productive if it had focused on exploring them.
So the sections “Counteracting deception with only interp is not the only approach” and “Preventive measures against deception”, “Cognitive Emulations” and “Technical Agendas with better ToI” don’t feel productive? It seems to me that it’s already a good list of neglected research agendas. So I don’t understand.
if you hadn’t listed it as “Perhaps the main problem I have with interp”
In the above comment, I only agree with “we shouldn’t do useful work, because then it will encourage other people to do bad things”, and I don’t agree with your critique of “Perhaps the main problem I have with interp...” which I think is not justified enough.

Charbel-Raphaël 25 Jan 2024 23:42 UTC
10 points
−7
on: The case for ensuring that powerful AIs are controlled
I think this is very important, and this goes directly into the textbook from the future.
it seems to me that if we are careful enough and if we implement a lot of the things that are outlined in this agenda, we probably won’t die. But we need to do it.
That is why safety culture is now the bottleneck: if we are careful, we are an order of magnitude safer than if we are not. And I believe this is the most important variable.

Charbel-Raphaël 18 Aug 2023 23:09 UTC
10 points
2
in reply to: Rohin Shah’s comment on: Against Almost Every Theory of Impact of Interpretability
To give props to your last paragraphs, you are right about my concern that most alignment work is less important than governance work. Most of the funding in AI safety goes to alignment, AI governance is comparatively neglected, and I’m not sure that’s the best allocation of resources. I decided to write this post specifically on interpretability as a comparatively narrow target to train my writing.
I hope to work on a more constructive post, detailing constructive strategic considerations and suggesting areas of work and theories of impact that I think are most productive for reducing X-risks. I hope that such a post would be the ideal place for more constructive conversations, although I doubt that I am the best suited person to write it.

Charbel-Raphaël 29 Jan 2023 11:38 UTC
10 points
0
on: formal alignment: what it is, and some proposals
There is also davidad’s Open Agency Architecture
https://www.alignmentforum.org/posts/pKSmEkSQJsCSTK6nH/an-open-agency-architecture-for-safe-transformative-ai

Charbel-Raphaël 15 Apr 2024 17:23 UTC
9 points
6
on: Effectively Handling Disagreements—Introducing a New Workshop
I have tried Camille’s in-person workshop in the past and was very happy with it. I highly recommend it. It helped me discover many unknown unknowns.

Charbel-Raphaël 6 Apr 2024 21:47 UTC
9 points
0
in reply to: Mitchell_Porter’s comment on: My intellectual journey to (dis)solve the hard problem of consciousness
Frontpage comment guidelines:
- Aim to explain, not persuade
- Try to offer concrete models and predictions
- If you disagree, try getting curious about what your partner is thinking
- Don’t be afraid to say ‘oops’ and change your mind

Charbel-Raphaël 30 Jan 2023 19:52 UTC
9 points
8
in reply to: trevor’s comment on: Compendium of problems with RLHF
Thank you! Yes, for most of these issues, it’s possible to create GIFs or at least pictograms. I can see the value this could bring to decision-makers.
However, even though I am quite honored, it’s not because I wrote this post that I am the best person to do this kind of work. So, if anyone is inspired to work on this, feel free to send me a private message.

Charbel-Raphaël 7 Jul 2020 22:42 UTC
8 points
on: The Equation of Knowledge
I read this book two years ago when it was published in French. I found it incredibly exciting to read, and that’s what motivated me to discover this site and then move on to a master’s degree in machine learning.

This book saved me a lot of time in discovering Bayesianism, and made a much deeper change in my way of thinking than if I had simply read a textbook of Bayesian machine learning.

I am of course happy to have read the sequences, but I think I am lucky to have started with the equation of knowledge which is much shorter to read and which provides the theoretical assurances, motivation, main tools, enthusiasm and pedagogy to engage in the quest for Bayesianism.

Charbel-Raphaël 8 Jun 2022 21:41 UTC
7 points
5
in reply to: Eliezer Yudkowsky’s comment on: AGI Ruin: A List of Lethalities
If Facebook AI research is such a threat, wouldn’t it be possible to talk to Yann LeCun?

Charbel-Raphaël 7 May 2022 1:08 UTC
LW: 7 AF: 4
AF
in reply to: Charbel-Raphaël’s comment on: High-stakes alignment via adversarial training [Redwood Research report]
Ah, “The tool-assisted attack went from taking 13 minutes to taking 26 minutes per example.”

Interesting. Changing the in-distribution (3oom) does not influences much the out-distribution (*2)

Charbel-Raphaël 14 May 2021 11:55 UTC
7 points
in reply to: Oscar_Cunningham’s comment on: Agency in Conway’s Game of Life
Could you just explain a bit “will only be likely to contain arbitrary patterns of sizes up to log(10^120)” please ? Or give some pointers with other usage of such calculation ?

Charbel-Raphaël 2 Jan 2024 10:46 UTC
6 points
1
in reply to: Nora Belrose’s comment on: Thoughts on “AI is easy to control” by Pope & Belrose
It’s not strong evidence; it’s a big mess, and it seems really difficult to have any kind of confidence in such a fast-changing world. It feels to me that it’s going to be a roughly ⁵⁰⁄₅₀ bet. Saying the probability is 1% requires much more work that I’m not seeing, even if I appreciate what you are putting up.
On the offense-defense balance, there is no clear winner in the comment sections here, neither here. We’ve already seen a takeover between two different roughly equal human civilizations (see the story of the conquistadors) under certain circumstances. And AGI is at least more dangerous than nuclear weapons, and we came pretty close to nuclear war several times. Covid seems to come from gain of function research, etc...
On fast vs slow takeoff, it seems to me that fast takeoff breaks a lot of your assumptions, and I would assign much more than a 1% probability for fast takeoff. Even when you still embrace the compute-centric framework (which I find conservative), you still get wild numbers, like a two-digits probability of takeoff lasting less than a year. If so, we won’t have the time to implement defense strategies.

Charbel-Raphaël 1 Nov 2023 23:38 UTC
6 points
3
on: Preventing Language Models from hiding their reasoning
I really enjoyed this article and I’m now convinced that a better understanding of steganography could be a crucial part of OpenAI’s plan.
Communicating only 1 bit may be enough for AIs to coordinate on a sudden takeover: this is not a regime where restricting AIs to a small capacity is sufficient for safety
I’ve never thought about this. Scary.

Charbel-Raphaël 14 Jun 2022 0:02 UTC
6 points
0
on: AGI Safety FAQ / all-dumb-questions-allowed thread
Why won’t this alignment idea work?
Researchers have already succeeded in creating face detection systems from scratch, by coding the features one by one, by hand. The algorithm they coded was not perfect, but was sufficient to be used industrially in digital cameras of the last decade.
The brain’s face recognition algorithm is not perfect either. It has a tendency to create false positives, which explains a good part of the paranormal phenomena. The other hard-coded networks of the brain seem to rely on the same kind of heuristics, hard-coded by evolution, and imperfect.
However, it turns out that humans, despite these imperfect evolutionary heuristics, are generally cooperative and friendly.
This suggests that the seed of alignment can be roughly coded and yet work.
1. Can’t we replicate the kind of research effort of hand-crafting human detectors, and hand-crafting “friendly” behaviour?
2. Nowadays, this quest would be facilitated by deep learning: no need to hand-craft a baby detector, just train a neural network that recognizes babies and triggers a reaction at a certain threshold that releases the hormones of tenderness. There is no need to code the detector, just train it. And then, only the reaction corresponding to the tenderness hormone must be coded.
3. By this process, there will be gaping holes, which will have to be covered one by one. But this is certainly what happened during evolution.
The problems are:
- We are not allowed to iterate with a strong AI
- We are not sure that this would extrapolate well to higher levels of capability
Ok
But if we were to work on it today, it would only have a sub-human level, and we could iterate like on a child. And even if we had the complete code of the brain stem, and we had “Reverse-enginered human social instincts” as Steven Byrnes proposes here, it seems to me that we still would have to do all this.
What do you think?

Charbel-Raphaël 6 Jun 2022 20:54 UTC
6 points
on: Epistemological Vigilance for Alignment

Charbel-Raphaël 27 Apr 2024 16:23 UTC

5 points

on: Constructability: Plainly-coded AGIs may be feasible in the near future

[We don’t think this long term vision is a core part of constructability, this is why we didn’t put it in the main post]

We asked ourselves what should we do if constructability works in the long run.

We are unsure, but here are several possibilities.

Constructability could lead to different possibilities depending on how well it works, from most to less ambitious:

Using GPT-6 to implement GPT-7-white-box (foom?)
Using GPT-6 to implement GPT-6-white-box
Using GPT-6 to implement GPT-4-white-box
Using GPT-6 to implement Alexa++, a humanoid housekeeper robot that cannot learn
Using GPT-6 to implement AlexNet-white-box
Using GPT-6 to implement a transparent expert system that filters CVs without using protected features

Comprehensive AI services path

We aim to reach the level of Alexa++, which would already be very useful: No more breaking your back to pick up potatoes. Compared to the robot Figure01, which could kill you if your neighbor jailbreaks it, our robot seems safer and would not have the capacity to kill, but only put the plates in the dishwasher, in the same way that today’s Alexa cannot insult you.

Fully autonomous AGI, even if transparent, is too dangerous. We think that aiming for something like Comprehensive AI Services would be safer. Our plan would be part of this, allowing for the creation of many small capable AIs that may compose together (for instance, in the case of a humanoid housekeeper, having one function to do the dishes, one function to walk the dog, …).

Alexa++ is not an AGI but is already fine. It even knows how to do a backflip Boston dynamics style. Not enough for a pivotal act, but so stylish. We can probably have a nice world without AGI in the wild.

The Liberation path

Another possible moonshot theory of impact would be to replace GPT-7 with GPT-7-plain-code. Maybe there’s a “liberation speed n” at which we can use GPT-n to directly code GPT-p with p>n. That would be super cool because this would free us from deep learning.

Different long term paths that we see with constructability.

Guided meditation path

You are not really enlightened if you are not able to code yourself.

Maybe we don’t need to use something as powerful as GPT-7 to begin this journey.

We think that with significant human guidance, and by iterating many many times, we could meander iteratively towards a progressive deconstruction of GPT-5.

We could use current models as a reference to create slightly more transparent and understandable models, and use them as reference again and again until we arrive at a fully plain-coded model.

Going from GPT-5 to GPT-2-hybrid seems possible to us.
Improving GPT-2-hybrid to GPT-3-hybrid may be possible with the help of GPT-5?
...

If successful, this path could unlock the development of future AIs using constructability instead of deep learning. If constructability done right is more data efficient than deep learning, it could simply replace deep learning and become the dominant paradigm. This would be a much better endgame position for humans to control and develop future advanced AIs.

Path	Feasibility	Safety
Comprehensive AI Services	Very feasible	Very safe but unstable in the very long run
Liberation	Feasible	Unsafe but could enable a pivotal act that makes things stable in the long run
Guided Meditation	Very Hard	Fairly safe and could unlock a safer tech than deep learning which results in a better end-game position for humanity.