Richard_Ngo comments on An overview of 11 proposals for building safe advanced AI

Richard_Ngo 16 Jun 2020 20:16 UTC
LW: 7 AF: 5
AF
Nice gradings. I was confused about this one, though:
Narrow reward modeling + transparency tools: High TOW, Medium WIT.
Why is this the most trustworthy thing on the list? To me it feels like one of the least trustworthy, because it’s one of the few where we haven’t done recursive safety checks (especially given lack of adversarial training).
- Charlie Steiner 19 Jun 2020 1:25 UTC
  LW: 2 AF: 1
  AF Parent
  Basically because I think that amplification/recursion, in the current way I think it’s meant, is more trouble than it’s worth. It’s going to produce things that have high fitness according to the selection process applied, which in the limit are going to be bad.
  
  On the other hand, you might see this as me claiming that “narrow reward modeling” includes a lot of important unsolved problems. HCH is well-specified enough that you can talk about doing it with current technology. But fulfilling the verbal description of narrow value learning requires some advances in modeling the real world (unless you literally treat the world as a POMDP and humans as Boltzmann-rational agents, in which case we’re back down to bad computational properties and also bad safety properties), which gives me the wiggle room to be hopeful.