I’m pretty confused about the plan to use ELK to solve outer alignment. If Cakey is not actually trained, how are amplified humans accessing its world model?
”To avoid this fate, we hope to find some way to directly learn whatever skills and knowledge Cakey would have developed over the course of training without actually training a cake-optimizing AI...
Use imitative generalization combined with amplification to search over some space of instructions we could give an amplified human that would let them make cakes just as delicious as Cakey’s would have been.
Avoid the problem of the most helpful instructions being opaque (e.g. “Run this physics simulation, it’s great”) by solving ELK — i.e., finding a mapping from whatever possibly-opaque model of the world happens to be most useful for making superhumanly delicious cakes to concepts humans care about like “people” being “alive.”
Spell out a procedure for scoring predicted futures that could be followed by an amplified human who has access to a) Cakey’s great world model, and b) the correspondence between it and human concepts of interest. We think this procedure should choose scores using some heuristic along the lines of “make sure humans are safe, preserve option value, and ultimately defer to future humans about what outcomes to achieve in the world” (we go into much more detail in Appendix: indirect normativity).
Distill their scores into a reward model that we use to train Hopefully-Aligned-Cakey, which hopefully uses its powers to help humans build the utopia we want.”
I’m pretty confused about the plan to use ELK to solve outer alignment. If Cakey is not actually trained, how are amplified humans accessing its world model?
”To avoid this fate, we hope to find some way to directly learn whatever skills and knowledge Cakey would have developed over the course of training without actually training a cake-optimizing AI...
Use imitative generalization combined with amplification to search over some space of instructions we could give an amplified human that would let them make cakes just as delicious as Cakey’s would have been.
Avoid the problem of the most helpful instructions being opaque (e.g. “Run this physics simulation, it’s great”) by solving ELK — i.e., finding a mapping from whatever possibly-opaque model of the world happens to be most useful for making superhumanly delicious cakes to concepts humans care about like “people” being “alive.”
Spell out a procedure for scoring predicted futures that could be followed by an amplified human who has access to a) Cakey’s great world model, and b) the correspondence between it and human concepts of interest. We think this procedure should choose scores using some heuristic along the lines of “make sure humans are safe, preserve option value, and ultimately defer to future humans about what outcomes to achieve in the world” (we go into much more detail in Appendix: indirect normativity).
Distill their scores into a reward model that we use to train Hopefully-Aligned-Cakey, which hopefully uses its powers to help humans build the utopia we want.”