Charbel-Raphaël comments on Davidad’s Bold Plan for Alignment: An In-Depth Explanation

Charbel-Raphaël 3 Jan 2025 16:58 UTC
LW: 12 AF: 4
11
AF
Ok, time to review this post and assess the overall status of the project.
Review of the post
What i still appreciate about the post: I continue to appreciate its pedagogy, structure, and the general philosophy of taking a complex, lesser-known plan and helping it gain broader recognition. I’m still quite satisfied with the construction of the post—it’s progressive and clearly distinguishes between what’s important and what’s not. I remember the first time I met Davidad. He sent me his previous post. I skimmed it for 15 minutes, didn’t really understand it, and thought, “There’s no way this is going to work.” Then I reconsidered, thought about it more deeply, and realized there was something important here. Hopefully, this post succeeded in showing that there is indeed something worth exploring! I think such distillation and analysis are really important.
I’m especially happy about the fact that we tried to elicit as much as we could from Davidad’s model during our interactions, including his roadmap and some ideas of easy projects to get early empirical feedback on this proposal.
Current Status of the Agenda.
(I’m not the best person to write this, see this as an informal personal opinion)
Overall, Davidad performed much better than expected with his new job as program director in ARIA and got funded 74M$ over 4 years. And I still think this is the only plan that could enable the creation of a very powerful AI capable of performing a true pivotal act to end the acute risk period, and I think this last part is the added value of this plan, especially in the sense that it could be done in a somewhat ethical/democratic way compared to other forms of pivotal acts. However, it’s probably not going to happen in time.
Are we on track? Weirdly, yes for the non-technical aspects, no for the technical ones? The post includes a roadmap with 4 stages, and we can check if we are on track. It seems to me that Davidad jumped directly to stage 3, without going through stages 1 and 2. This is because of having been selected as research director for ARIA, so he’s probably going to do 1 and 2 directly from ARIA.
- Stage 1 Early Research Projects is not really accomplished:
  - “Figure out the meta ontology theory”: Maybe the most important point of the four, currently WIP in ARIA, but a massive team of mathematicians has been hired to solve this.
  - “Heuristics used by the solver”: Nope
  - “Building a toy infra-Bayesian “Super Mario”, and then applying this framework to model Smart Grids”: Nope
  - “Training LLMs to write models in the PRISM language by backward distillation”: Kind of already here, probably not very high value to spend time here, I think this is going to be solved by default.
- Stage 2: Industry actors’ first projects: I think this step is no longer meaningful because of ARIA.
- Stage 3: formal arrangement to get labs to collectively agree to increase their investment in OAA, is almost here, in the sense that Davidad got millions to execute this project in ARIA and he published his Multi-author manifesto which backs the plan with legendary names especially with Yoshua Bengio as the scientific director of this project.
The lack of prototyping is concerning. I would have really liked to see an “infra-Bayesian Super Mario” or something similar, as mentioned in the post. If it’s truly simple to implement, it should have been done by now. This would help many people understand how it could work. If it’s not simple, that would reveal it’s not straightforward at all. Either way, it would be pedagogically useful for anyone approaching the project. If we want to make these values democratic, etc.. It’s very regrettable that this hasn’t been done after two years. (I think people from the AI Objectives Institute tried something at some point, but I’m not aware of anything publicly available.) I think this complete lack prototypes is my number one concern preventing me from recommending more “safe by design” agendas to policymakers.
This plan was an inspiration for constructability: It might be the case that the bold plan could decay gracefully, for example into constructability, by renouncing formal verification and only using traditional software engineering techniques.
International coordination is an even bigger bottleneck than I thought. The “CERN for AI” isn’t really within the Overton window, but I think this applies to all the other plans, and not just Davidad’s plan. (Davidad made a little analysis of this aspect here).
At the end of the day: Kudos to Davidad for successfully building coalitions, which is already beyond amazing! and he is really an impressive thought leader. What I’m waiting to see for the next year is using AIs such as O3 that are already impressive in terms of competitive programming and science knowledge, and seeing what we can already do with that. I remain excited and eager to see the next steps of this plan.
- David Matolcsi 3 Jan 2025 21:57 UTC
  5 points
  0
  Parent
  What is an infra-Bayesian Super Mario supposed to mean? I studied infra-Bayes under Vanessa for half a year, and I have no idea what this could possibly mean. I asked Vanessa when this post came out and she also said she can’t guess what you might mean under this. Can you explain what this is? It makes me very skeptical that the only part of the plan I know something about seems to be nonsense.
  Also, can you give more information ir link to a resource on what Davidad’s team is currently doing? It looks like they are the best funded AI safety group that currently exist (except if you count Anthropic), but I never hear about them.
  - Quinn 3 Jan 2025 22:23 UTC
    2 points
    0
    Parent
    (i’m guessing) super mario might refer to a simulation of the Safeguarded AI / Gatekeeper stack in a videogame. It looks like they’re skipping videogames and going straight to cyberphysical systems (1, 2).
    - Quinn 16 Jan 2025 0:53 UTC
      2 points
      0
      Parent
      Update: new funding call from ARIA calls out the Safeguarded/Gatekeeper stack in a video game directly
      
      Creating (largely) self-contained prototypes/minimal-viable-products of a Safeguarded AI workflow, similar to this example but pushing for incrementally more advanced environments (e.g. Atari games).
- Ben Goldhaber 16 Jan 2025 19:45 UTC
  3 points
  0
  Parent
  This post was one of my first introductions to davidad’s agenda and convinced me that while yes it was crazy, it was maybe not impossible, and it led me to working on initiatives like the multi-author manifesto you mentioned.
  Thank you for writing it!

Charbel-Raphaël comments on Davidad’s Bold Plan for Alignment: An In-Depth Explanation

Review of the post

Current Status of the Agenda.