ryan_greenblatt comments on ryan_greenblatt’s Shortform

ryan_greenblatt 21 Jun 2025 0:14 UTC
63 points
41
Sometimes, an AI safety proposal has an issue and some people interpret that issue as a “fatal flaw” which implies that the proposal is unworkable for that objective while other people interpret the issue as a subproblem or a difficulty that needs to be resolved (and potentially can be resolved). I think this is an interesting dynamic which is worth paying attention to and it seems worthwhile to try to develop heuristics for better predicting what types of issues are fatal.^[1]

Here’s a list of examples:
- IDA had various issues pointed out about it as a worst case (or scalable) alignment solution, but for a while Paul thought these issues could be solved to yield a highly scalable solution. Eliezer did not. In retrospect, it isn’t plausible that IDA with some improvements yields a scalable alignment solution (especially while being competitive).
- You can point out various issues with debate, especially in the worst case (e.g. exploration hacking and fundamental limits on human understanding). Some people interpret these things as problems to be solved (e.g., we’ll solve exploration hacking) while others think they imply debate is fundamentally not robust to sufficiently superhuman models and thus it’s a technique to be used at an intermediate level of capability.
- In the context of untrusted monitoring and training an untrusted monitor with PVG, you can view exploration hacking and collusion as problems that someone will need to solve or as fundamental reasons why we’ll really want to be using a more diverse set of countermeasures.^[2]
- ARC theory’s agenda has a number of identified issues in the way of a worst case safety solution. ARC theory thinks they’ll be able to patch these issues while David (and I) are skeptical.
I don’t have any overall commentary on this, I just think it is worth noticing.
1. ↩︎
  I discuss this dynamic in the context of AI safety, but presumably it’s relevant for other fields.
2. ↩︎
  While the other bullets discuss worst case ambitions, this isn’t about things being fatal in the worst case, rather it’s about the intervention potentially not being that helpful without some additional measures.
- Chris_Leong 22 Jun 2025 6:33 UTC
  7 points
  0
  Parent
  In retrospect, it isn’t plausible that IDA with some improvements yields a scalable alignment solution (especially while being competitive).
  
  Why do you say this?
- TsviBT 23 Jun 2025 19:32 UTC
  4 points
  0
  Parent
  Another aspect of this is that we don’t have a good body of obstructions (they are clear, they are major in that they defeat many proposals, people understand and apply them, they are widely-enough agreed on).
  
  https://www.lesswrong.com/posts/Ht4JZtxngKwuQ7cDC/tsvibt-s-shortform?commentId=CzibNhsFf3RyxHpvT
- Joey KL 24 Jun 2025 18:17 UTC
  3 points
  0
  Parent
  Could you say more about the unviability of IDA?
- Knight Lee 23 Jun 2025 8:50 UTC
  3 points
  0
  Parent
  I suspect there are a lot of people who dismissively consider an idea unworkable after trying to solve a flaw for less than 1 hour, and there are a lot of people who stubbornly insist an idea can still be saved after failing to solve a flaw for more than 1 year.
  Maybe it’s too easy to reject other people’s ideas, and too hard to doubt your own ideas.
  Maybe it’s too easy to continue working on an idea you’re already working on, and too hard to start working on an idea you just heard (due to the sunken cost fallacy and the friction of starting something new).