Steven Byrnes comments on Buck’s Shortform

Steven Byrnes 25 Aug 2021 3:22 UTC
LW: 7 AF: 4
0
AF
I wonder what you mean by “competitive”? Let’s talk about the “alignment tax” framing. One extreme is that we can find a way such that there is no tradeoff whatsoever between safety and capabilities—an “alignment tax” of 0%. The other extreme is an alignment tax of 100%—we know how to make unsafe AGIs but we don’t know how to make safe AGIs. (Or more specifically, there are plans / ideas that an unsafe AI could come up with and execute, and a safe AI can’t, not even with extra time/money/compute/whatever.)
I’ve been resigned to the idea that an alignment tax of 0% is a pipe dream—that’s just way too much to hope for, for various seemingly-fundamental reasons like humans-in-the-loop being more slow and expensive than humans-out-of-the-loop (more discussion here). But we still want to minimize the alignment tax, and we definitely want to avoid the alignment tax being 100%. (And meanwhile, independently, we try to tackle the non-technical problem of ensuring that all the relevant players are always paying the alignment tax.)
I feel like your post makes more sense to me when I replace the word “competitive” with something like “arbitrarily capable” everywhere (or “sufficiently capable” in the bootstrapping approach where we hand off AI alignment research to the early AGIs). I think that’s what you have in mind?—that you’re worried these techniques will just hit a capabilities wall, and beyond that the alignment tax shoots all the way to 100%. Is that fair? Or do you see an alignment tax of even 1% as an “insufficient strategy”?
- Pattern 25 Aug 2021 15:25 UTC
  2 points
  0
  Parent
  One extreme is that we can find a way such that there is no tradeoff whatsoever between safety and capabilities—an “alignment tax” of 0%.
  I think was the idea behind ‘oracle ai’s’. (Though I’m aware there were arguments against that approach.)
  
  One of the arguments I didn’t see for
  sorting out the practical details of how to implement them:
  was:
  “As we get better at this alignment stuff we will reduce the ‘tradeoff’. (Also, arguably, getting better human feedback improves performance.)