TLW comments on [Intro to brain-like-AGI safety] 5. The “long-term predictor”, and TD learning

TLW 2 Mar 2022 2:32 UTC
3 points
AF
Just to make sure we’re on the same page, I made up the “300ms” number, it could be something else.
Sure; the further you get away from ~300ms the less the number makes sense for e.g. predicting neuron latency, as described earlier.
Also to make sure we’re on the same page, I claim that from a design perspective, fast oscillation instabilities are bad, and from an introspective perspective, fast oscillation instabilities don’t happen. (I don’t have goosebumps, then 150ms later I don’t have goosebumps, then 150ms later I do have goosebumps, etc.)
I absolutely agree that most of the time oscillations don’t happen. That being said, oscillations absolutely do happen in at least one case—epilepsy. I remain puzzled that evolution “allows” epilepsy to happen, and epilepsy being a breakdown that does allow ~300ms oscillations to happen, akin to feedback in audio amplifiers, is a better explanation for this than I’ve heard elsewhere.
Sorry, I’m confused. There’s an I and a D? I only see a P.
A generic overdamped PID controller will react to a step-change in its input via (vaguely)-exponential decay towards the new value^[1].
Even for a non-overdamped PID controller the magnitude of the tail decreases exponentially with time. (So long as said PID controller is stable at least.)
You are correct that all that is necessary for a PID controller to react in this fashion is a nonzero P term.
It seems to me that you can start a startle reaction quickly (small fraction of a second), but you can’t stop a startle quickly.
Absolutely; a step change followed by a decay still has high-frequency components. (This is the same thing people forget when they route ‘slow’ clocks with fast drivers and then wonder why they are getting crosstalk on other signals and high-frequency interference in general.)
Your slow-responding predictor is going to have a terrible effective reaction time, is what I’m trying to say here, because you’re filtering out the high-frequency components of the prediction error, and so the rising edge of your prediction error gets filtered from a step change to something closer to a sigmoid that takes quite a while to get to full amplitude.… which in turn means that what the predictor learns is not a step-change followed by a decay. It learns the output of a low-pass filter on said step-change followed by a decay, a.k.a. a slow rise and decay.
I also want to stay very very far away from any situation where there might be fast oscillations that originate in instability rather than already being present in exogenous data.
Right. Which brings me back to my puzzle: why does epilepsy continue to exist?

(Do you at least agree that, were there some mechanism where there was enough feedback/crosstalk such that you did get oscillations, it might look something like epilepsy?)
And I continue to believe that these things are all compatible
Can you please give an example of a general-purpose function estimator, that when plugged into this pseudo-TD system, both:
1. Can learn “most^[2]” functions
2. Has a low-and-bounded learning rate regardless of current parameters, such that $| \frac{dOut}{dFeedback} | < 1$ (after a single update, that is).
I know of schemes that achieve 1, and schemes that achieve 2. I don’t know of any schemes that achieve both offhand^[3].
*****
Thank you again for going back and forth with me on this by the way. I appreciate it.
1. ^
  ...or some offset from the new value, in some cases.
2. ^
  I’m not going to worry too much if e.g. there’s a single unstable pathological case.
3. ^
  LReLU violates 2. LReLU with regularization violates 1. Etc.
What links here?
- Steven Byrnes's comment on [Intro to brain-like-AGI safety] 5. The “long-term predictor”, and TD learning by Steven Byrnes (4 Mar 2022 14:48 UTC; 2 points)
- Steven Byrnes 4 Mar 2022 14:48 UTC
  2 points
  AF Parent
  Sure; the further you get away from ~300ms the less the number makes sense for e.g. predicting neuron latency, as described earlier.
  I must have missed that part; can you point more specifically to what you’re referring to?
  why does epilepsy continue to exist?
  I think practically anywhere in the brain, if A connects to B, then it’s a safe bet that B connects to A. (Certainly for regions, and maybe even for individual neurons.) Therefore we have the setup for epileptic seizures, if excitation and inhibition are not properly balanced.
  Or more generically, if X% of neurons in the brain are active at time t, then we want around X% of neurons in the brain to be active at time t+1. That means that we want each upstream neuron firing event to (on average) cause exactly one net downstream neuron to fire. But individual neurons have their own inputs and outputs; by default, there seems to be a natural failure mode where the upstream neurons excite not-exactly-one downstream neuron, and we get exponential growth (or decay).
  My impression is that there are lots of mechanisms to balance excitation and inhibition—probably different mechanisms in different parts of the brain—and any of those mechanisms can fail. I’m not an epilepsy expert by any means (!!) , but at a glance it does seem like epilepsy has a lot of root causes and can originate in lots of different brain areas, including areas that I don’t think are doing this kind of prediction, e.g. temporal lobe and dorsolateral prefrontal cortex and hippocampus.
  the rising edge of your prediction error gets filtered from a step change to something closer to a sigmoid that takes quite a while to get to full amplitude.… which in turn means that what the predictor learns is not a step-change followed by a decay. It learns the output of a low-pass filter on said step-change followed by a decay, a.k.a. a slow rise and decay.
  I still think you’re incorrectly mixing up the time-course of learning (changes to parameters / weights / synapse strengths) with the time-course of an output following a sudden change in input. I think they’re unrelated.
  To clarify our intuitions here, I propose to go to the slow-learning limit.
  However fast you’ve been imagining the parameters / weights / synapse strength changing in any given circumstance, multiply that learning rate by 0.001. And simultaneously imagine that the person experiences everything in their life with 1000× more repetitions. For example, instead of getting whacked by a golf ball once, they get whacked by a golf ball 1000× (on 1000 different days).
  (Assume that the algorithm is exactly the same in every other respect.)
  I claim that, after this transformation (much lower learning rate, but proportionally more repetitions), the learning algorithm will build the exact same trained model, and the person will flinch the same way under the same circumstances.
  (OK, I can imagine it being not literally exactly the same, thanks to the details of the loss landscape and gradient descent etc., but similar.)
  Your perspective, if I understand it, would be that this transformation would make the person flinch more slowly—so slowly that they would get hit by the ball before even starting to flinch.
  If so, I don’t think that’s right.
  Every time the person gets whacked, there’s a little interval of time, let’s say 50ms, wherein the context shows a golf ball flying towards the person’s face, and where the supervisor will shortly declare that the person should have been flinching. That little 50ms interval of time will contribute to updating the synapse strengths. In the slow-learning limit, the update will be proportionally smaller, but OTOH we’ll get that many more repetitions in which the same update will happen. It should cancel out, and it will eventually converge to a good prediction, F(ball-flying-towards-my-face) = I-should-flinch.
  And after training, even if we lower the learning rate all the way down to zero, we can still get fast flinching at appropriate times. It would only be a problem if the person changes hobbies from golf to swimming—they wouldn’t learn the new set of flinch cues.
  Sorry if I’m misunderstanding where you’re coming from.
  Can you please give an example of a general-purpose function estimator, that when plugged into this pseudo-TD system, both:
  Can learn “most^[2]” functions
  Has a low-and-bounded learning rate regardless of current parameters, such that $| \frac{dOut}{dFeedback} | < 1$ (after a single update, that is).
  If you take any solution to 1, and multiply the learning rate by 0.000001, then it would satisfy 2 as well, right?
  - TLW 6 Mar 2022 5:24 UTC
    3 points
    AF Parent
    I must have missed that part; can you point more specifically to what you’re referring to?
    It feels wrong to refer you back to your own writing, but much of part 4 was dedicated to talking about these short-term predictors being used to combat neural latency and to do… well, short-term predictions. A flinch detector that goes off 100ms in advance is far less useful than a flinch detector that goes off 300ms in advance, but at the same time a short-term predictor that predicts too far in advance leads to feedback when used as a latency counter (as I asked about/noted in the previous post).
    (It’s entirely possible that different predictors have different prediction timescales… but then you’re just replaced the problem with a meta-problem. Namely: how do predictors choose the timescale?)
    To clarify our intuitions here, I propose to go to the slow-learning limit.
    However fast you’ve been imagining the parameters / weights / synapse strength changing in any given circumstance, multiply that learning rate by 0.001. And simultaneously imagine that the person experiences everything in their life with 1000× more repetitions. For example, instead of getting whacked by a golf ball once, they get whacked by a golf ball 1000× (on 1000 different days).
    1x the training data with 1x the training rate is not equivalent to 1000x the training data with 1/1000th of the training rate. Nowhere near. The former is a much harder problem, generally speaking.
    (And in a system as complex and chaotic as a human there is no such thing as repeating the same datapoint multiple times… related data points yes. Not the same data point.)
    (That being said, 1x the training data with 1x the training rate is still harder than 1x the training data with 1/1000th the training rate, repeated 1000x.)
    Your perspective, if I understand it, would be that this transformation would make the person flinch more slowly—so slowly that they would get hit by the ball before even starting to flinch.
    You appear to be conflating two things here. It’s worth calling them out as separate.
    Putting a low-pass filter on the learning feedback signal absolutely does cause something to learn a low-passed version of the output. Your statement “In that case, the circuit would be basically incapable of “fast” dynamics (i.e. it would have implicit low-pass filters everywhere),” doesn’t really work, precisely because it leads to absurd conclusions. This is what I was calling out.
    A low learning rate is something different. (That has other problems...)
    If you take any solution to 1, and multiply the learning rate by 0.000001, then it would satisfy 2 as well, right?
    My apologies, and you are correct as stated; I should have added something on few-shot learning. Something like a flinch detector likely does not fire 1,000,000x in a human lifetime^[1], which means that your slow-learning solution hasn’t learnt anything significant by the time the human dies, and isn’t really a solution.
    I am aware that 1m is likely you just hitting ‘0’ a bunch of times’; humans are great few-shot (and even one-shot) learners. You can’t just drop the training rate or else your examples like ‘just stand on the ladder for a few minutes and your predictor will make a major update’ don’t work.
    ^
    My flinch reflex works fine and I’d put a trivial upper-bound of 10k total flinches (probably even 1k is too high). (I lead a relatively quiet life.)
    - Steven Byrnes 7 Mar 2022 20:51 UTC
      2 points
      AF Parent
      Oh, hmm. In my head, the short-term predictors in the cerebellum are for latency-reduction and discussed in the last post, and meanwhile the short-term predictors in the telencephalon (amygdala & mPFC) are for flinching and discussed here. I think the cerebellum short-term predictors and the telencephalon short-term predictors are built differently for different purposes, and once we zoom in beyond the idea of “short-term prediction” and start talking about parameter settings etc., I really don’t lump them together in my mind, they’re apples and oranges. In the conversation thus far, I thought you were talking about the telencephalon (amygdala & mPFC) ones. If we’re talking about instability from the cerebellum instead, we can continue the Post #4 thread.
      ~
      I think I said some things about low-pass filters up-thread and then retracted it later on, and maybe you missed that. At least for some of the amygdala things like flinching, I agree with you that low-pass filters seem unlikely to be part of the circuit (well, depending on where the frequency cutoff is, I suppose). Sorry, my bad.
      ~
      A common trope is that the hippocampus does one-shot learning in a way that vaguely resembles a lookup table with auto-associative recall, whereas other parts of the cortex learn more generalizable patterns more slowly, including via memory recall (i.e., gradual transfer of information from hippocampus to cortex). I’m not immediately sure whether the amygdala does one-shot learning. I do recall a claim that part of PFC can do one-shot learning, but I forget which part; it might have been a different part than we’re talking about. (And I’m not sure if the claim is true anyway.) Also, as I said before, with continuous-time systems, “one shot learning” is hard to pin down; if David Burns spends 3 seconds on the ladder feeling relaxed, before climbing down, that’s kinda one-shot in an intuitive sense, but it still allows the timescale of synapse changes to be much slower than the timescale of the circuit. Another consideration is that (I think) a synapse can get flagged quickly as “To do: make this synapse stronger / weaker / active / inactive / whatever”, and then it takes 20 minutes or whatever for the new proteins to actually be synthesized etc. so that the change really happens. So that’s “one-shot learning” in a sense, but doesn’t necessarily have the same short-term instabilities, I’d think.