nostalgebraist comments on Alignment Implications of LLM Successes: a Debate in One Act

nostalgebraist 22 Oct 2023 0:00 UTC
34 points
23
The example confuses me.
If you literally mean you are prompting the LLM with that text, then the LLM must output the answer immediately, as the string of next-tokens right after the words assuming I'm telling the truth, is:. There is no room in which to perform other, intermediate actions like persuading you to provide information.
It seems like you’re imagining some sort of side-channel in which the LLM can take “free actions,” which don’t count as next-tokens, before coming back and making a final prediction about the next-tokens. This does not resemble anything in LM likelihood training, or in the usual user interaction modalities for LLMs.
You also seem to be picturing the LLM like an RL agent, trying to minimize next-token loss over an entire rollout. But this isn’t how likelihood training works. For instance, GPTs do not try to steer texts in directions that will make them easier to predict later (because the loss does not care whether they do this or not).
(On the other hand, if you told GPT-4 that it was in this situation—trying to predict next-tokens, with some sort of side channel it can use to gather information from the world—and asked it to come up with plans, I expect it would be able to come up with plans like the ones you mention.)
- Max H 22 Oct 2023 0:15 UTC
  6 points
  −6
  Parent
  It seems like you’re imagining some sort of side-channel in which the LLM can take “free actions,” which don’t count as next-tokens, before coming back and making a final prediction about the next-tokens. This does not resemble anything in LM likelihood training, or in the usual interaction modalities for LLMs.
  
  I’m saying that the lack of these side-channels implies that GPTs alone will not scale to human-level.
  
  If your system interface is a text channel, and you want the system behind the interface to accept inputs like the prompt above and return correct passwords as an output, then if the system is:
  - an auto-regressive GPT directly fed your prompt as input, it will definitely fail
  - A human with the ability to act freely in the background before returning an answer, it will probably succeed
  - an AutoGPT-style system backed by a current LLM, with the ability to act freely in the background before returning an answer, it will probably fail. (But maybe if your AutoGPT implementation or underlying LLM is a lot stronger, it would work.)
  And my point is that, the reason the human probably succeeds and the reason AutoGPT might one day succeed, is precisely because they have more agency than a system that just auto-regressively samples from a language model directly.
  - Max H 22 Oct 2023 0:30 UTC
    5 points
    −8
    Parent
    Or, another way of putting it:
    
    It seems like you’re imagining some sort of side-channel in which the LLM can take “free actions,” which don’t count as next-tokens, before coming back and making a final prediction about the next-tokens. This does not resemble anything in LM likelihood training, or in the usual user interaction modalities for LLMs.
    
    These are limitations of current LLMs, which are GPTs trained via SGD. But there’s no inherent reason you can’t have a language model which predicts next tokens via shelling out to some more capable and more agentic system (e.g. a human) instead. The result would be a (much slower) system that nevertheless achieves lower loss according to the original loss function.