Great post! I very much hope we can do some clever things with value learning that let us get around needing AbD to do the things that currently seem to need it.
The fundamental example of this is probably optimizability—is your language model so safe that you can query it as part of an optimization process (e.g. making decisions about what actions are good), without just ending up in the equivalent of deepDream’s pictures of Maximum Dog.
Great post! I very much hope we can do some clever things with value learning that let us get around needing AbD to do the things that currently seem to need it.
The fundamental example of this is probably optimizability—is your language model so safe that you can query it as part of an optimization process (e.g. making decisions about what actions are good), without just ending up in the equivalent of deepDream’s pictures of Maximum Dog.