Charbel-Raphaël 18 Aug 2023 1:14 UTC
54 points
33
in reply to: Richard_Ngo’s comment on: Against Almost Every Theory of Impact of Interpretability
This seems like very much the wrong type of reasoning to do about novel scientific research. Big breakthroughs open up possibilities that are very hard to imagine before those breakthroughs.
What type of reasoning do you think would be most appropriate?
This proves too much. The only way to determine whether a research direction is promising or not is through object-level arguments. I don’t see how we can proceed without scrutinizing the agendas and listing the main difficulties.
this by itself is sufficient to recommend it.
I don’t think it’s that simple. We have to weigh the good against the bad, and I’d like to see some object-level explanations for why the bad doesn’t outweigh the good, and why the problem is sufficiently tractable.
Interpretability seems like our clear best bet for developing a more principled understanding of how deep learning works;
Maybe. I would still argue that other research avenues are neglected in the community.
not focused enough on forward-chaining to find the avenues of investigation which actually allow useful feedback and give us a hook into understanding the world better
I provided plenty of technical research direction in the “preventive measures” section, this should also qualifies as forward-chaining. And interp is certainly not the only way to understand the world better. Besides, I didn’t say we should stop Interp research altogether, just consider other avenues.
More generally, I’m strongly against arguments of the form “we shouldn’t do useful work, because then it will encourage other people to do bad things”. In theory arguments like these can sometimes be correct, but in practice perfect is often the enemy of the good.
I think I agree, but this is only one of the many points in my post.

Charbel-Raphaël 13 May 2022 0:30 UTC
51 points
on: Deepmind’s Gato: Generalist Agent
There is no fire alarm for AGIs? Maybe just subscribe to the DeepMind RSS feed…
On a more serious note, I’m curious about the internal review process for this article, what role did the DeepMind AI safety team play in it? In the Acknowledgements, there is no mention of their contribution.

My intellectual journey to (dis)solve the hard problem of consciousness

Charbel-Raphaël6 Apr 2024 9:32 UTC

37 points

41 comments30 min readLW link

Results from the Turing Seminar hackathon

Charbel-Raphaël, jeanne_ and WCargo

7 Dec 2023 14:50 UTC

29 points

1 comment6 min readLW link

AIS 101: Task decomposition for scalable oversight

Charbel-Raphaël25 Jul 2023 13:34 UTC

27 points

0 comments19 min readLW link

(docs.google.com)

aisafety.info, the Table of Content

Charbel-Raphaël31 Dec 2023 13:57 UTC

23 points

1 comment11 min readLW link

An Overview of AI risks—the Flyer

Charbel-Raphaël, Jonathan Claybrough and tchauvin

17 Jul 2023 12:03 UTC

20 points

0 comments1 min readLW link

(docs.google.com)

Charbel-Raphaël 6 Feb 2024 19:39 UTC
20 points
1
on: My guess at Conjecture’s vision: triggering a narrative bifurcation
Thank you for writing this, Alexandre. I am very happy that this is now public, and some paragraphs in part 2 are really nice gems.
I think parts 1 and 2 are a must read for anyone who wants to work on alignment, and articulate dynamics that I think extremely important.
Parts 3-4-5, which focus more on Conjecture, are more optional in my opinion, and could have been another post, but are still interesting. This has changed my opinion of Conjecture and I see much more coherence in their agenda. My previous understanding of Conjecture’s plan was mostly focused on their technical agenda, CoEm, as presented in a section here. However, I was missing the big picture. This is much better.

AI Safety 101 - Chapter 5.2 - Unrestricted Adversarial Training

Charbel-Raphaël31 Oct 2023 14:34 UTC

17 points

0 comments19 min readLW link

AI Safety 101 - Chapter 5.1 - Debate

Charbel-Raphaël31 Oct 2023 14:29 UTC

14 points

0 comments13 min readLW link

Easy fixing Voting

Charbel-Raphaël2 Oct 2022 17:03 UTC

12 points

2 comments1 min readLW link

[Question] How to impress students with recent advances in ML?

Charbel-Raphaël14 Jul 2022 0:03 UTC

12 points

2 comments1 min readLW link

Charbel-Raphaël 18 Aug 2023 23:26 UTC
12 points
10
in reply to: Richard_Ngo’s comment on: Against Almost Every Theory of Impact of Interpretability
I agree that I haven’t argued the positive case for more governance/coordination work (and that’s why I hope to do a next post on that).
We do need alignment work, but I think the current allocation is too focused on alignment, whereas AI X-Risks could arrive in the near future. I’ll be happy to reinvest in alignment work once we’re sure we can avoid X-Risks from misuses and grossly negligent accidents.

New Hackathon: Robustness to distribution changes and ambiguity

Charbel-Raphaël31 Jan 2023 12:50 UTC

11 points

3 comments1 min readLW link

Charbel-Raphaël 18 Aug 2023 5:54 UTC
11 points
13
in reply to: Richard_Ngo’s comment on: Against Almost Every Theory of Impact of Interpretability
This post would have been far more productive if it had focused on exploring them.
So the sections “Counteracting deception with only interp is not the only approach” and “Preventive measures against deception”, “Cognitive Emulations” and “Technical Agendas with better ToI” don’t feel productive? It seems to me that it’s already a good list of neglected research agendas. So I don’t understand.
if you hadn’t listed it as “Perhaps the main problem I have with interp”
In the above comment, I only agree with “we shouldn’t do useful work, because then it will encourage other people to do bad things”, and I don’t agree with your critique of “Perhaps the main problem I have with interp...” which I think is not justified enough.

Open application to become an AI safety project mentor

Charbel-Raphaël29 Sep 2022 11:27 UTC

10 points

0 comments1 min readLW link

(docs.google.com)