Take a big language model like GPT-3, and then train it via RL on tasks where it gets given a language instruction from a human, and then it gets reward if the human thinks it’s done the task successfully.
Makes sense, thanks!
Take a big language model like GPT-3, and then train it via RL on tasks where it gets given a language instruction from a human, and then it gets reward if the human thinks it’s done the task successfully.
Makes sense, thanks!