Seth Herd comments on Arguments for optimism on AI Alignment (I don’t endorse this version, will reupload a new version soon.)

Seth Herd 24 Oct 2023 20:16 UTC
3 points
−2
It looks like you already took out the 99.9...% claims, which are the primary thing I was reacting to. That’s great IMO. I think the new phrasing of “not claiming this is right, just getting the logic out there” is way better- both more honest and ultimately more convincing if the logic holds.
jBut that’s a major edit without noting the edit, so I think this should be a draft right now, not a post that’s evolving so that the comments are now addressing an earlier version. Publishing a second version that includes much of the first is a great idea.
I’d choose a different term than white box, as per Steve Byrnes’ conclusion that he just won’t use those terms since they’re confusing.
My biggest substantive comment is that you seem to be assuming that because we could get alignment right, we will get alignment right. Even Yudkowsky agrees that we could get it right.
You’re arguing that it’s a lot easier than assumed, and I think that’s probably right. But that’s not enough to be confident that we will get it right. It will depend on how seriously the first person to make self-improving AGI takes alignment, even if there are easy techniques available. Will they use them, or will they race and take risks?
- Noosphere89 24 Oct 2023 23:38 UTC
  2 points
  0
  Parent
  
  jBut that’s a major edit without noting the edit, so I think this should be a draft right now, not a post that’s evolving so that the comments are now addressing an earlier version. Publishing a second version that includes much of the first is a great idea.
  
  I honestly agree with this. I feel that the post has been edited so much that I now think it’s time to delete this post and reupload a new version of it so that I can actually deal with the edits, without having this weird patchwork post.
  
  I’d choose a different term than white box, as per Steve Byrnes’ conclusion that he just won’t use those terms since they’re confusing.
  
  Yeah, I’ll probably edit it to emphasize something else.
  
  My biggest substantive comment is that you seem to be assuming that because we could get alignment right, we will get alignment right. Even Yudkowsky agrees that we could get it right.
  
  I am definitely assuming that, but I do think it’s a weak assumption, assuming that at least some part of my post holds true. In essence, I’m hoping that OpenAI doesn’t do the worst thing even if it isn’t favored by profit incentives.
  
  You’re arguing that it’s a lot easier than assumed, and I think that’s probably right. But that’s not enough to be confident that we will get it right. It will depend on how seriously the first person to make self-improving AGI takes alignment, even if there are easy techniques available. Will they use them, or will they race and take risks?
  
  The good news is that assuming value learning is easy, then we have an easier time, since we can do AI regulations a lot more normally, and in particular, we don’t need to be that strict with licensing. Don’t get me wrong, AI governance is necessary in this world, but the type of governance would be drastically different.
  
  No pauses, for one example.
  - Seth Herd 25 Oct 2023 0:57 UTC
    8 points
    0
    Parent
    Agreed on all points. This is closely related to my thinking on how we survive, which is why I care about seeing it presented in a way people can hear and understand. I’ll send you a draft of the closely related post I’m working on, and if you haven’t seen it, I focus on that last point, values learning being relatively easy, in this post: The (partial) fallacy of dumb superintelligence.
    I think it’s worth explicitly discussing the assumption that people won’t do “the dumbest possible thing”. It’s a reasonable assumption, but it’s probably a little more complicated than that. If alignment taxes are non-zero, there will be some pull between different motivations.
    - Noosphere89 25 Oct 2023 1:09 UTC
      2 points
      0
      Parent
      
      I think it’s worth explicitly discussing the assumption that people won’t do “the dumbest possible thing”. It’s a reasonable assumption, but it’s probably a little more complicated than that. If alignment taxes are non-zero, there will be some pull between different motivations.
      
      Yeah, it kinda depends on how small the alignment tax is. If it’s not 0, like I unfortunately suspect, but instead small, then there is a small chance of extinction risk. I definitely plan to discuss that when I reupload the post after deleting it first.
      
      Thanks for talking with me today!