You’re asking about pure predictive (a.k.a. self-supervised) learning. As far as I know, it’s an open question what the safety issues are for that (if any), even in a very concrete case like “this particular Transformer architecture trained on this particular dataset using SGD”. I spent a bit of time last summer thinking about it, but didn’t get very far. See my post self-supervised learning and manipulative predictions for one particular possible failure mode that I wasn’t able to either confirm or rule out. (I should go back to it at some point.) See also my post self-supervised learning and AGI safety for everything else I know on the topic. And of course I must mention Abram’s delightful Parable of Predict-o-matic if you haven’t already seen it; again, this is high-level speculation that might or might not apply to any particular concrete system (“this particular Transformer architecture trained by SGD”). Lots of open questions!
An additional set of potential problems comes from your suggestion to put it in a robot body and actually execute the commands. Can it even walk? Of course it can figure out walking by letting it try with a reward signal, but now we’re not talking about pure predictive learning anymore. Hmm, after thinking about it, I guess I’m cautiously optimistic that, in the limit of infinite training data from infinitely many robots learning to walk, a large enough Transformer doing predictive learning could learn to read its own sense data and walk without any reward signal. But then how do you get it to do useful things? My suggestion here was to put a metadata flag into inputs where a robot is being super-helpful, and then when you have the robot start acting in the real world, turn that flag on. Now we’re bringing in supervised learning, I guess.
In the event that the robot was actually capable of doing anything at all, I would be very concerned that you press go and then the system wanders farther and farther out of distribution and does weird, dangerous things that have a high impact on the world.
As for concrete advice for the GPT-7 team: I would suggest at least throwing out the robot body and making a text / image prediction system in a box, and then put a human in the loop looking at the screen before going out and doing stuff. This can still be very powerful and economically useful, and it’s a step in the right direction: it eliminates the problem of the system just going off and doing something weird and high-impact in the world because it wandered out of distribution. It doesn’t eliminate all problems, because the system might still become manipulative. As I mentioned in the 1st paragraph, I don’t know whether that’s a real problem or not, more research is needed. It’s possible that we’re all just doomed in your scenario. :-)
You’re asking about pure predictive (a.k.a. self-supervised) learning. As far as I know, it’s an open question what the safety issues are for that (if any), even in a very concrete case like “this particular Transformer architecture trained on this particular dataset using SGD”. I spent a bit of time last summer thinking about it, but didn’t get very far. See my post self-supervised learning and manipulative predictions for one particular possible failure mode that I wasn’t able to either confirm or rule out. (I should go back to it at some point.) See also my post self-supervised learning and AGI safety for everything else I know on the topic. And of course I must mention Abram’s delightful Parable of Predict-o-matic if you haven’t already seen it; again, this is high-level speculation that might or might not apply to any particular concrete system (“this particular Transformer architecture trained by SGD”). Lots of open questions!
An additional set of potential problems comes from your suggestion to put it in a robot body and actually execute the commands. Can it even walk? Of course it can figure out walking by letting it try with a reward signal, but now we’re not talking about pure predictive learning anymore. Hmm, after thinking about it, I guess I’m cautiously optimistic that, in the limit of infinite training data from infinitely many robots learning to walk, a large enough Transformer doing predictive learning could learn to read its own sense data and walk without any reward signal. But then how do you get it to do useful things? My suggestion here was to put a metadata flag into inputs where a robot is being super-helpful, and then when you have the robot start acting in the real world, turn that flag on. Now we’re bringing in supervised learning, I guess.
In the event that the robot was actually capable of doing anything at all, I would be very concerned that you press go and then the system wanders farther and farther out of distribution and does weird, dangerous things that have a high impact on the world.
As for concrete advice for the GPT-7 team: I would suggest at least throwing out the robot body and making a text / image prediction system in a box, and then put a human in the loop looking at the screen before going out and doing stuff. This can still be very powerful and economically useful, and it’s a step in the right direction: it eliminates the problem of the system just going off and doing something weird and high-impact in the world because it wandered out of distribution. It doesn’t eliminate all problems, because the system might still become manipulative. As I mentioned in the 1st paragraph, I don’t know whether that’s a real problem or not, more research is needed. It’s possible that we’re all just doomed in your scenario. :-)
Thanks!