Rafael Harth comments on [Intro to brain-like-AGI safety] 6. Big picture of motivation, decision-making, and RL

Rafael Harth 10 Apr 2022 13:00 UTC
4 points
AF
(Extremely speculative comment, please tell me if this is nonsense.)

If it makes sense to differentiate the “Thought Generator” and “Thought Assessor” as two separate modules, is it possible to draw a parallel to language models, which seem to have strong ability to generate sentences, but lack the ability to assess if they are good?

My first reaction to this is “obviously not since the architecture is completely different, so why would they map onto each other?”, but a possible answer could be “well if the brain has them as separate modules, it could mean that the two tasks require different solutions, and if one is much harder than the other, and the harder one is the assess module, that could mean language models would naturally solve just the generation first”.

The related thing that I find interesting is that, a priori, it’s not at all obvious that you’d have these two different modules at all (since the thought generator already receives ground truth feedback). Does this mean the distinction is deeply meaningful? Well, that depends on close to optimal the [design of the human brain] is.
- Steven Byrnes 12 Apr 2022 16:11 UTC
  LW: 4 AF: 1
  AF Parent
  If it makes sense to differentiate the “Thought Generator” and “Thought Assessor” as two separate modules, is it possible to draw a parallel to language models, which seem to have strong ability to generate sentences, but lack the ability to assess if they are good?
  Hmm. An algorithm trained to reproduce human output is presumably being trained to imitate the input-output behavior of the whole system including Thought Generator and Thought Assessor and Steering Subsystem.
  I’m trying to imagine deleting the Thought Assessor & Steering Subsystem, and replacing them with a constant positive RPE (i.e., “this is good, keep going”) signal. I think what you get is a person talking with no inhibitions whatsoever. Language models don’t match that.
  a priori, it’s not at all obvious that you’d have these two different modules at all (since the thought generator already receives ground truth feedback)
  I think it’s a necessary design feature for good performance. Think of it this way. I’m evolution. Here are two tasks I want to solve:
  (A) estimate how much a particular course-of-action advances inclusive genetic fitness,
  (B) find courses of action that get a good score according to (A).
  (B) obviously benefits from incorporating a learning algorithm. And if you think about it, (A) benefits from incorporating a learning algorithm as well. But the learning algorithms involved in (A) and (B) are fundamentally working at cross-purposes. (A) is the critic, (B) is the actor. They need separate training signals and separate update rules. If you try to blend them together in an end-to-end way, you just get wireheading. (I.e., if the (B) learning algorithm had free reign to update the parameters in (A), using the same update rule as (B), then it would immediately set (A) to return infinity all the time, i.e. wirehead.) (Humans can wirehead to some extent (see Post #9) but we need to explain why it doesn’t happen universally and permanently within five minutes of birth.)
  - Rafael Harth 12 Apr 2022 19:33 UTC
    LW: 2 AF: 1
    AF Parent
    
    I think what you get is a person talking with no inhibitions whatsoever. Language models don’t match that.
    
    What do you picture a language model with no inhibitions to look like? Because if I try to imagine it, then “something that outputs reasonable sounding text until sooner or later it fails hard” seems to be a decent fit. Of course haven’t thought much about the generator/assessor distinction.
    
    I mean, surely “inhibitions” of the language model don’t map onto human inhibitions, right? Like, a language model without the assessor module (or a much worse assessor module) is just as likely to be imitate someone who sounds unrealistically careful as someone who has no restraints.
    
    I find your last paragraph convincing, but that of course makes me put more credence into the theory rather than less.
    - gwern 12 Apr 2022 20:34 UTC
      LW: 9 AF: 3
      AF Parent
      An analogy that comes to mind is sociopathy. Closely linked to fear/reward insensitivity and impulsivity. Something you see a lot in case studies of diagnosed or accounts of people who look obviously like sociopaths is that they will be going along just fine, very competent and intelligent seeming, getting away with everything, until they suddenly do something which is just reckless, pointless, useless and no sane person could possibly think they’d get away with it. Why did they do X, which caused the whole house of cards to come tumbling down and is why you are now reading this book or longform investigative piece about them? No reason. They just sorta felt like it. The impulse just came to them. Like jumping off a bridge.
      - Steven Byrnes 13 Apr 2022 17:31 UTC
        LW: 2 AF: 2
        AF Parent
        Huh. I would have invoked a different disorder.
        I think that if we replace the Thought Assessor & Steering Subsystem with the function “RPE = +∞ (regardless of what’s going on)”, the result is a manic episode, and if we replace it with the function “RPE = -∞ (regardless of what’s going on)”, the result is a depressive episode.
        In other words, the manic episode would be kinda like the brainstem saying “Whatever thought you’re thinking right now is a great thought! Whatever you’re planning is an awesome plan! Go forth and carry that plan out with gusto!!!!” And the depressive episode would be kinda like the brainstem saying “Whatever thought you’re thinking right now is a terrible thought. Stop thinking that thought! Think about anything else! Heck, think about nothing whatsoever! Please, anything but that thought!”
        My thoughts about sociopathy are here. Sociopaths can be impulsive (like everyone), but it doesn’t strike me as a central characteristic, as it is in mania. I think there might sometimes be situations where a sociopath does X, and onlookers characterize it as impulsive, but in fact it’s just what the sociopath wanted to do, all things considered, stemming from different preferences / different reward function. For example, my impression is that sociopaths get very bored very easily, and will do something that seems crazy and inexplicable from a neurotypical perspective, but seems a good way to alleviate boredom from their own perspective.
        (Epistemic status: Very much not an expert on mania or depression, I’ve just read a couple papers. I’ve read a larger number of books and papers on sociopathy / psychopathy (which think are synonyms?), plus there were two sociopaths in my life that I got to know reasonably well, unfortunately. More of my comments about depression here.)
      - Rafael Harth 12 Apr 2022 21:16 UTC
        LW: 2 AF: 1
        AF Parent
        Do you think this describes language models?
        gwern 12 Apr 2022 21:17 UTC
        LW: 7 AF: 2
        AF Parent
        ‘Insensitivity to reward or punishment’ does sound relevant...
        Rafael Harth 12 Apr 2022 22:06 UTC
        LW: 2 AF: 1
        AF Parent
        Yes, but I didn’t mean to ask whether it’s relevant, I meant to ask whether it’s accurate. Does the output of language models, in fact, feel like this? Seemed like something relevant to ask you since you’ve seen lots of text completions.
        
        And if it does, what is the reason for not having long timelines? If neural networks only solved the easy part of the problem, that implies that they’re a much smaller step toward AGI than many argued recently.
        gwern 13 Apr 2022 0:38 UTC
        LW: 7 AF: 2
        AF Parent
        I said it was an analogy. You were discussing what intelligent human-level entities with inhibition control problems would hypothetically look like; well, as it happens, we do have such entities, in the form of sociopaths, and as it happens, they do not simply explode in every direction due to lacking inhibitions but often perform at high levels manipulating other humans until suddenly then they explode. This is proof of concept that you can naturally get such streaky performance without any kind of exotic setup or design. Seems relevant to mention.