Nice gradings. I was confused about this one, though:
Narrow reward modeling + transparency tools: High TOW, Medium WIT.
Why is this the most trustworthy thing on the list? To me it feels like one of the least trustworthy, because it’s one of the few where we haven’t done recursive safety checks (especially given lack of adversarial training).
Basically because I think that amplification/recursion, in the current way I think it’s meant, is more trouble than it’s worth. It’s going to produce things that have high fitness according to the selection process applied, which in the limit are going to be bad.
On the other hand, you might see this as me claiming that “narrow reward modeling” includes a lot of important unsolved problems. HCH is well-specified enough that you can talk about doing it with current technology. But fulfilling the verbal description of narrow value learning requires some advances in modeling the real world (unless you literally treat the world as a POMDP and humans as Boltzmann-rational agents, in which case we’re back down to bad computational properties and also bad safety properties), which gives me the wiggle room to be hopeful.
Nice gradings. I was confused about this one, though:
Why is this the most trustworthy thing on the list? To me it feels like one of the least trustworthy, because it’s one of the few where we haven’t done recursive safety checks (especially given lack of adversarial training).
Basically because I think that amplification/recursion, in the current way I think it’s meant, is more trouble than it’s worth. It’s going to produce things that have high fitness according to the selection process applied, which in the limit are going to be bad.
On the other hand, you might see this as me claiming that “narrow reward modeling” includes a lot of important unsolved problems. HCH is well-specified enough that you can talk about doing it with current technology. But fulfilling the verbal description of narrow value learning requires some advances in modeling the real world (unless you literally treat the world as a POMDP and humans as Boltzmann-rational agents, in which case we’re back down to bad computational properties and also bad safety properties), which gives me the wiggle room to be hopeful.