I’m Steve Byrnes, a professional physicist in the Boston area. I have a summary of my AGI safety research interests at: https://sjbyrnes.com/agi.html
Thanks! No worries! I’m in the process of understanding this stuff too :-P
in the AGI case, we want to make sure the inner objective is as close to the outer objective as possible, whereas in the brain, we want to make sure the outer objective doesn’t corrupt the inner objective.
I’m not sure I agree with the second one. Maybe my later discussion in Mesa-Optimizers vs Steered Optimizers is better:
You try a new food, and find it tastes amazing! This wonderful feeling is your subcortex sending a steering signal up to your neocortex. All of the sudden, a new goal has been installed in your mind: eat this food again! This is not your only goal in life, of course, but it is a goal, and you might use your intelligence to construct elaborate plans in pursuit of that goal, like shopping at a different grocery store so you can buy that food again.It’s a bit creepy, if you think about it!“You thought you had a solid identity? Ha!! Fool, you are a puppet! If your neocortex gets dopamine at the right times, all of the sudden you would want entirely different things out of life!”
You try a new food, and find it tastes amazing! This wonderful feeling is your subcortex sending a steering signal up to your neocortex. All of the sudden, a new goal has been installed in your mind: eat this food again! This is not your only goal in life, of course, but it is a goal, and you might use your intelligence to construct elaborate plans in pursuit of that goal, like shopping at a different grocery store so you can buy that food again.
It’s a bit creepy, if you think about it!
“You thought you had a solid identity? Ha!! Fool, you are a puppet! If your neocortex gets dopamine at the right times, all of the sudden you would want entirely different things out of life!”
Yes I do take the perspective of the inner optimizer, but I have mixed feelings about my goals changing over time as a result of the outer layer’s interventions. Like, if I taste a new food and really like it, that changes my goals, but that’s fine, in fact that’s a delightful part of my life. Whereas, if I thought that reading nihilistic philosophers would carry a risk of making me stop caring about the future, I would be reluctant to read nihilistic philosophers. Come to think of it, neither of those is a hypothetical!
Are these ‘patterns’ the same as the generative models?
Yes. Kaj calls (a subset of) them “subagents”, I more typically call them “generative models”, Kurzweil calls them “patterns”, Minsky calls this idea “society of mind”, etc.
And does ‘randomly generated’ mean that, if I learn a new pattern, my neocortex generates a random set of neurons that is then associated with that pattern from that point onward?
Yes, that’s my current belief fwiw, although to be clear, I only think it’s random on a micro-scale. On the large scale, for example, patterns in raw visual inputs are going to be mainly stored in the part of the brain that receives raw visual inputs, etc. etc.
So, if a meditator says that they have mastered mindfulness to the point that they can experience pain without suffering, your explanation of that (provided you believe the claim) would be, they have reprogrammed their neocortex such that it no longer classifies the generally-pain-like signals from the subcotex as pain?
Sure, but maybe a more everyday example would be a runner pushing through towards the finish line while experiencing runner’s high, or a person eating their favorite spicy food, or whatever. It’s still the same sensors in your body sending signals, but in those contexts you probably wouldn’t describe those signals as “I am in pain right now”.
As for valence, I was confused about valence when I wrote this, and it’s possible I’m still confused. But I felt less confused after writing Emotional Valence Vs RL Reward: A Video-Game Analogy. I’m still not sure it’s right—just the other day I was thinking that I should have said “positive reward prediction error” instead of “reward” throughout that article. I’m going back and forth on that, not sure.
Most model-based RL algorithms I’ve seen assume they can evaluate the reward functions in arbitrary states.
Hmm. AlphaZero can evaluate the true reward function in arbitrary states. MuZero can’t—it tries to learn the reward function by supervised learning from observations of past rewards (if I understand correctly). I googled “model-based RL Atari” and the first hit was this which likewise tries to learn the reward function by supervised learning from observations of past rewards (if I understand correctly). I’m not intimately familiar with the deep RL literature, I wouldn’t know what’s typical and I’ll take your word for it, but it does seem that both possibilities are out there.
Anyway, I don’t think the neocortex can evaluate the true reward function in arbitrary states, because it’s not a neat mathematical function, it involves messy things like the outputs of millions of pain receptors, hormones sloshing around, the input-output relationships of entire brain subsystems containing tens of millions of neurons, etc. So I presume that the neocortex tries to learn the reward function by supervised learning from observations of past rewards—and that’s the whole thing with TD learning and dopamine.
I added a new sub-bullet at the top to clarify that it’s hard to explain by RL unless you assume the planner can query the ground-truth reward function in arbitrary hypothetical states. And then I also added a new paragraph to the “other possible explanations” section at the bottom saying what I said in the paragraph just above. Thank you.
I don’t see how you solve this problem in general in a sample-efficient manner otherwise.
Well, the rats are trying to do the rewarding thing after zero samples, so I don’t think “sample-efficiency” is quite the right framing.
In ML today, the reward function is typically a function of states and actions, not “thoughts”. In a brain, the reward can depend directly on what you’re imagining doing or planning to do, or even just what you’re thinking about. That’s my proposal here.
Well, I guess you could say that this is still a “normal MDP”, but where “having thoughts” and “having ideas” etc. are part of the state / action space. But anyway, I think that’s a bit different than how most ML people would normally think about things.
This seems usefully and sensible, thanks!
asking whether the new task is something you want to do more than the old task
asking whether the new task is something you want to do more than the old task
Why not just say “whether the new task is a better thing to do than the old task”? Like, there’s the rule to start your day with whatever has the biggest “ugh field”, i.e. whatever is most painful. I get that you’re using “want to do more than” broadly, but still, it carries the wrong connotation for me. :-)
Also: favorite parenting books for ages 0-2: No Bad Kids, Ferber, Anthropology of Childhood, Oh Crap, King Baby, Bringing Up Bebe
Age 4-7ish learning resources recommendations! Needless to say, different kids will take to different things. But here’s my experience fwiw.
MATH—DragonBox—A+, whole series of games running from basic numeracy through geometry and algebra. Excellent gameplay, well-crafted, kid loves them.
Number Blocks (Netflix) - A+, basic numeracy, addition, concept of multiplication. Kid must have watched each episode 10 times and enthused about it endlessly.
Counting Kingdom—A+, Mastering mental addition. Excellent gameplay; fun for adults too. Note: Not currently available on ipad; I got it on PC Steam.
Slice Fractions 1 & 2 - A+, Teaches fractions. Great gameplay, great pedagogy.
An old-fashioned pocket calculator—A+, an underrated toy.
LITERACY: Explode The Code book—A, been around since at least the 1980s, still good.
Teach Your Monster To Read—B+, gameplay is a bit repetitive & difficulty progresses too quickly, but got a few hours of great learning in there before he lost interest
Poio - A-, Good gameplay, kid really liked it. Limited scope but great for what it is.
For reading, no individual thing seemed to make a huge difference and none of them kept his interest too long. But it all added up, bit by bit, and now he’s over the hump, reading unprompted. Yay!
PROGRAMMING—Scratch Jr—A+ , duh
I couldn’t be bothered to buy the rubber thing but I am finding that a pair of rubber bands improves both my surgical masks and my K95s (which otherwise don’t fit me well). Thanks for the tip!
Oh, OK, I see what you mean. Possibly related: my comment here.
Update: Both the podcast and the article were interesting, enjoyable, and helpful :-)
Humans don’t have a training / deployment distinction either… Do humans have “reusable parameters”? Not quite sure what you mean by that.
Strong agree that I have lots of detailed thoughts about the neocortex’s algorithms and am probably implicitly leaning on them in ways that I’m not entire aware of and not communicating well. I appreciate your working with me. :-)
I do want to walk back a bit about the reward prediction error stuff. I think the following is equivalent but simpler:
I propose that the subcortex sends a reward related to the time-derivative of how strongly the neocortex is imagining / expecting to taste salt. So the neocortex gets a reward for first entertaining the idea of tasting salt, and another incremental reward for growing that idea into a definite plan. But then it would get a negative reward for dropping that idea.
(I think this is maybe related to the Russell-Ng potential-based reward shaping thing.)
the neocortex is constantly incentivized to fool the basal ganglia into predicting higher rewards
Well, there’s a couple things, I think.
First, the neocortex can’t just expect arbitrary things. It’s constrained by self-supervised learning, which throws out models that have, in the past, made predictions refuted by experience. Like, let’s say that every time you open the door, the handle makes a click. You’re going to start expecting the click to happen. You have no choice, you can’t not expect it! There are also constraints around self-consistency and other things, like you can’t visualize something that is simultaneously stationary and dancing; those two models are just inconsistent, and the message-passing algorithm will simply not allow both to be active at the same time.
Second, I think that one neocortex “thought” is made up of a large number of different components, and all of them carry separate reward predictions, which are combined (somehow) to get the attractiveness of the overall thought. Like, when you decide to step outside, you might expect to feel cold and sore muscles and wind and you’ll say goodbye to the people inside … all those different components could have different attractiveness. And an RPE changes the reward predictions of all of the ingredients of the thought, I think.
So like, if you’re very hungry but have no food, you can say to yourself “I’m going to open my cupboard and find that food has magically appeared”, and it seems like that should be a positive-RPE thought. But actually, the thought doesn’t carry a positive reward. The “I will find food” part by itself does, but meanwhile you’re also activating the thought “I am fooling myself”, and the previous 10 times that thought was active, it carried a negative RPE, so that thought carries a very negative RP whenever it’s invoked. But you can’t get rid of that thought, because it previously made correct sensory predictions in this kind of situation—that’s the previous paragraph.
Imagining eating beans to decide how rewarding they would be doesn’t seem to get any harder if I already know I don’t have any beans. And it doesn’t feel like “thoughts of eating beans” are reinforced, it feels like I gain abstract knowledge that eating beans would be rewarded.
I would posit that it’s a subtle effect in this particular example, because you don’t actually care that much about beans. I would say “You get a subtle positive reward for entertaining the idea of eating beans, and then if you realize that you’re out of beans and put the idea aside, you get a subtle negative reward upon going back to baseline.” I think if you come up with less subtle examples it might be easier to think about, perhaps.
My general feeling is that if you just abstractly think about something for no reason in particular, it activates the models weakly (and ditto if you hear that someone else is thinking about that thing, or remember that thing in the past, etc.) If you start to think of it as “something that will happen to me”, that activates the models more strongly. If you are directly experiencing the thing right now, it activates the model most strongly of all. I acknowledge that this is vague and unjustified, I wrote this but it’s all pretty half-baked.
An additional complication is that, as above, one thought consists of a bunch of component sub-thoughts, which all impact the reward prediction. If you imagine eating beans knowing that you’re not actually going to, the “knowing that I’m not actually going to” part of the thought can have its own reward prediction, I suppose.
Oh, yet another thing is that I think maybe we have no subjective awareness of “reward”, just RPE. (Reward does not feel rewarding!) So if we (1) decide “I will imagine yummy food”, then (2) imagine yummy food, then (3) stop imagining yummy food, we get a positive reward from the second step and a negative reward from the third step, but both of those rewards were already predicted by the first step, so there’s no RPE in either the second or third step, and therefore they don’t feel positive or negative. Unless we’re hungrier than we thought, I guess...
it seems like there’s a floor on the effect size, where arbitrarily low probability eventually stops weakening the effect
Yeah sure, if a model is active at all, it’s active above some threshold, I think. Like, if the neuron fires once every 10 minutes, then, well, the model is not actually turned on and affecting the brain. This is probably related to our inability to deal with small probabilities.
Meanwhile, it’s quite possible to trigger physiological responses by imagining things.
Yes, I would say the “neocortex is imagining / expecting to taste salt” signal has many downstream effects, one of which is affecting the reward signal, one of which is causing salivation.
This doesn’t seem like it stops working if you keep doing it
Really? I think that if some thought causes you to salivate, but doesn’t actually ever lead to eating for hours afterwards, and this happens over and over again for weeks, your systems would learn to stop salivating. I guess I don’t know for sure. Didn’t Pavlov do that experiment? See also my “scary movie” example here.
the rat starts salivating and feels something in its stomach that it previously learned means “my body wants the food” and concludes eating salt would be a good idea
Basically, there could be a non-reward signal that indicates “whatever you’re thinking of, eat it and you’ll feel rewarded”. And that could be learned from eating other food over the course of life. Yeah, sure, that could work. I think it would sorta amount to the same thing, because the neocortex would just turn that signal into a reward prediction, and register a positive RPE when it sees it. So why not just cut out the middleman and create a positive RPE by sending a reward? I guess you would argue that if it’s not at all rewarding to imagine food that you know you’re not going to eat, your theory fits that better.
Still thinking about it.
Thanks again, you’re being very helpful :-)
At the moment, AI improves rapidly simply because current algorithms yield significant improvements when increasing compute. It is often better to double the compute than work on improving the algorithm. However, compute prices will decrease less rapidly in the future. Then, AI will need better algorithms. If these can not be found as rapidly as compute helped in the past, AI will not grow on the same trajectory any more. Progress slows. Then, a second AI winter can happen.
I kinda disagree with this, especially the first sentence. “Increasing compute” is indeed one thing that is happening in AI, and it’s in the headlines a lot, but it’s not the only thing happening in AI. Algorithmic innovations are happening now and have been happening all along. Like, 3 years ago, the Transformer had just been invented … 5 years ago there was no BatchNorm or ResNets …. In the area I’m most interested in (neocortex-like models), the (IMO) most promising developments are in a very early research-project-ish stage, maybe like deep neural nets were in the 1990s, probably years away from progressing to parallelized, hardware-accelerated, turn-key code that can even begin to be massively scaled.
Thanks! Haha, nothing wrong with introspection! It’s valid data, albeit sometimes misinterpreted or overgeneralized. Anyway, looking forward to your future posts!
Sure. I think even more interesting than the ratio / frequency argument is the argument that if you check whether the ice cube has coalesced, then that brings you into the system too, and now you can prove that the entropy increase from checking is, in expectation, larger than the entropy decrease from the unlikely chance that you find an ice cube. Repeat many times and the law of large numbers guarantees that this procedure increases entropy. Hence no perpetual motion. Well anyway, that’s the part I like, but I’m not disagreeing with you. :-)
Have you seen Eliezer’s 2008 post on the 2nd law? His perspective matches my own, and I was delighted to see it written up so nicely. (Eliezer might have gotten it from Jaynes? Not sure. I reinvented that wheel, for my part, it’s quite possible that Eliezer did too.) That style of argument convinces me that the 2nd law cannot depend on any empirical claims like “matter is stable at large scales”. It should just depend on conservation of phase space (Liouville’s theorem in classical mechanics, unitarity in quantum mechanics). And it depends on human brains being part of the same universe as everything else, and subject to the same laws.
The entropy of a given macrostate is the uncertainty about the microstate of an observer who knows only the macrostate. In general, you have more information than this.
I basically agree. There’s a nice description here of getting a subregion of phase space that looks like ever-finer filamentary threads spread all around, or as I put it “it often turns out that you wind up with useless information about the microstate, i.e. information that cannot be translated into a “magic recipe” for undoing the apparent disorder … In that case, you might as well just forget that information and accept a higher entropy.”
There are cases where it’s not obvious what the macrostate is, and you need to think more concretely about what exactly you can and can’t do with the microstate information. My example here was a light beam whose polarization is set by a pseudorandom code that changes every nanosecond. Another example would be shining light through a ground-glass diffuser: It looks like a random, high-entropy, diffuse beam … But if you have an exact copy of the original diffuser, and a phase-conjugate mirror, you can unwind the randomness and magically get the original low-entropy beam back.
You’re right. “Thinking about salt is rewarded anyway” doesn’t make sense and isn’t right. You’re one of two people to call me out on it, and I just posted a long comment replying to the other here. Thank you!! I just added a correction to the article:
(UPDATE: Commenters point out that this description isn’t quite right—it doesn’t make sense to say that the idea of tasting salt is rewarding per se. Rather, when the rat starts expecting to taste salt, the subcortex sends a positive reward-prediction-error signal, and conversely if the rat stops expecting to taste salt, the subcortex sends a negative reward-prediction-error signal. Something like that. Sorry for the mistake / confusion. Thanks commenters!)
Does that answer your question?
I don’t generally feel any desire to spend time fantasizing about the food I’m waiting for.
Haha, yeah, there’s a song about that.
So anyway, I think you’re onto something, and I think that something is that “reward” and “reward prediction” are two distinct concepts, but they’re all jumbled up in my mind, and therefore presumably also jumbled up in my writings. I’ve been vaguely aware of this for a while, but thanks for calling me out on it, I should clean up my act. So I’m thinking out loud here, bear with me, and I’m happy for any help. :-)
The TD learning algorithm is:
RPE=Reward Prediction Error=(r+V(snew))−V(sprev)
where sprev is the previous state, snew is the new state, V is the value function a.k.a. reward prediction, and r is the reward from this step.
(I’m ignoring discounting. BTW, I don’t think the brain literally does TD learning in the exact form that computer scientists do, but I think it’s close enough to get the right idea.)
So let’s go through two scenarios.
Scenario A: I’m going to eat candy, anticipating a large reward (high V(sprev)). I eat the candy (high r) then don’t anticipate any reward after that (low V(snew)). RPE=0 here. That’s just what I expected. V went down, but it went down in lock-step with the arrival of the reward r.
Scenario B: I’m going to eat candy, anticipating a large reward (high V(sprev)). Then I see that we’re out of candy! So I get no reward and have nothing to look forward to (r=0, low V(snew)). Now this is a negative (bad) RPE! Subjectively, this feels like crushing disappointment. The TD learning rule kicks in here, so that next time when I go to eat candy, I won’t be expecting as much reward as I did this time (lower V(sprev) than before), because I will be preemptively braced for the possibility that we’ll be out of candy.
OK, makes sense so far.
Interestingly, the reward r, as such, barely matters here! It’s not decision-relevant, right? Good actions can be determined entirely by the following rule:
Each step, do whatever maximizes RPE.
Or subjectively, thoughts and sensory inputs with positive RPE are attractive, while thoughts and sensory inputs with negative RPE are aversive.
OK, so when the rat first considers the possibility that it’s going to eat salt, it gets a big injection of positive RPE. It now (implicitly) expects a large upcoming reward. Let’s say for the sake of argument that it decides to not eat the salt, and go do something else. Well now we’re not expecting to eat the salt, whereas previously we were, so that’s a big injection of negative RPE. So basically, once it gets the idea that it can eat salt, it’s very aversive (negative RPE) to drop that idea, without actually consummating it (by eating the salt and getting the anticipated reward r).
Back to your food example, you go in with some baseline expectation for what dinner’s going to be like. Then you invoke the idea “I’m going to eat yam”. You get a negative RPE in response. OK, go back to the baseline plan then. You get a compensatory positive RPE. Then you invoke the idea “I’m going to eat beans”. You get a positive RPE. Alright! You think about it some more. Oh, I can’t have beans tonight, I don’t have any. You drop the idea and suffer a negative RPE. That’s aversive, but you’re stuck. Then you invoke another idea “I’m going to eat porridge”. Positive RPE! As you flesh out the plan, it becomes more confident, which activates the model more strongly, the idea in your head of having porridge becomes more vivid, so to speak. Each increment of increasing confidence that you’re going to eat porridge is rewarded by a corresponding spurt of RPE. Then you eat the porridge. Back to low RPE, but there’s a reward at the same time, so that’s fine, there’s no RPE.
Let’s go to fantasizing in general. Let’s say you get the idea that a wad of cash has magically appeared in your wallet. That idea is attractive (positive RPE). But sooner or later you’re going to actually look in the wallet and find that there’s no wad of cash (negative RPE). The negative RPE triggers the TD learning rule such that next time “the idea that a wad of cash has magically appeared in your wallet” will not be such an attractive idea, it will be tinged with a negative memory of it failing to happen. Of course, you could go the other way and try to avoid the negative RPE by clinging to the original story—like, don’t look in your wallet, or if you see that the cash isn’t there you think “guess I must have deposited in the bank already”, etc. This is unhealthy but certainly a known human foible. For example, as of this writing, in the USA, each of the two major presidential candidates has millions of followers who believe that their preferred candidate will be president for the next four years. It’s painful to let go of an idea that something good is going to happen, so you resist if at all possible. Luckily the brain has some defense systems against wishful thinking. For example you can’t not expect something to happen that you’ve directly experienced multiple times. See here. Another is: if you do eventually come back to earth, and the negative RPE finally does happen, then TD learning kicks in, and all the ideas and strategies that contributed to your resisting the truth until now get tarred with a reduction in associated RPE, which makes them less likely to be used next time.
Hmm, so maybe I had it right in the diagram here: I had the neocortex sending reward predictions to the subcortex, and the subcortex sending back RPEs to the neocortex. So if the neocortex sends a high reward prediction, then a low reward prediction, that might or might not be a RPE, depending on whether you just ate candy in between. Here, the subcortex sends a positive RPE when the neocortex starts imagining tasting salt, and sends a negative RPE when it stops imagining salt (unless it actually ate the salt at that moment). And if the salt imagination / expectation signal gets suddenly stronger, it sends a positive RPE for the difference, and so on.
(I could make a better diagram by pulling a “basal ganglia” box out of the neocortex subsystem into a separate box in the diagram. My understanding, definitely oversimplified, is that the basal ganglia has a dense web of connections across the (frontal lobe of the) neocortex, and just memorizes reward predictions associated with different arbitrary neocortical patterns. And it also suppresses patterns that lead to lower reward predictions and amplifies patterns that lead to higher reward predictions. So in the diagram, the neocortex would sends “information” to the basal ganglia, the basal ganglia calculates a reward prediction and sends it to the subcortex, and the subcortex sends the RPE to the basal ganglia (to alter the reward predictions) and to the neocortex (to reinforce or weaken the associated patterns). Something like that...).
Does that make sense? Sorry this is so long. Happy for any thoughts if you’ve read this far.
Another update: Actually maybe it’s simpler (and equivalent) to say the subcortex gives a reward proportional to the time-derivative of how strongly the salt-expectation signal is activated.
But don’t be too quick to write off Factored Cognition entirely based on that. The fact that it’s a problem doesn’t mean it’s unsolvable.
I agree. I’m always inclined to say something like “I’m a bit skeptical about factored cognition, but I guess maybe it could work, who knows, couldn’t hurt to try”, but then I remember that I don’t need to say that because practically everyone else thinks that too, even its most enthusiastic advocates, , as far as I can tell from my very light and casual familiarity with it.
I haven’t read any of the posts you’ve linked
Hmm, maybe if you were going to read just one of mine on this particular topic, it should be instead Can You Get AGI From A Transformer instead of the one I linked above. Meh, either way.
No way! Awesome, looking forward to listening to that! :-)
I find it helpful to think about our brain’s understanding as lots of subroutines running in parallel. They mostly just sit around doing nothing. But sometimes they recognize a scenario for which they have something to say, and then they jump in and say it. So in chess, there’s a subroutine that says “If the board position has such-and-such characteristics, it’s worthwhile to consider protecting the queen.” There’s a subroutine that says “If the board position has such-and-characteristics, it’s worthwhile to consider moving the pawn.” And of course, once you consider moving the pawn, that brings to mind a different board position, and then new subroutines will recognize them, jump in, and have their say.
So if you take an imperfect rule, like “Python code runs the same on Windows and Mac”, the reason we can get by using this rule is because we have a whole ecosystem of subroutines on the lookout for exceptions to the rule. There’s the main subroutine that says “Python code runs the same on Windows and Mac.” But there’s another subroutine that says “If you’re sharing code between Windows and Mac, and there’s a file path variable, then it’s important to follow such-and-such best practices”. And yet another subroutine is sitting around looking for UI code, ready to interject that that can also be a cross-platform incompatibility. And yet another subroutine is watching for you to call a system library, etc. etc.
So, imagine you’re working on a team, and you go to a team meeting. You sit around for a while, not saying anything. But then someone suggests an idea that you happened to have tried last week, which turned out not to work. Of course, you would immediately jump in to share your knowledge with the rest of the meeting participants. Then you go back to sitting quietly and listening.
I think your whole understanding of the world and yourself and everything is a lot like that. There are countless millions of little subroutines, watching for certain cues, and ready to jump in and have their say when appropriate. (Kaj calls these things “subagents”, I more typically call them “generative models”, Kurzweil calls them “patterns”, Minsky calls this idea “society of mind”, etc.)
Factored cognition doesn’t work this way (and that’s why I’m cautiously pessimistic about it). Factored cognition is like you show up at the meeting, present a report, and then leave. If you would have had something important to say later on, too bad, you’ve already left the room. I’m skeptical that you can get very far in figuring things out if you’re operating under that constraint.
Well, you are clearly an expert here.
LOL, “fake it til you make it” ftw! I disagree, but that’s very kind of you to say. :-)
assuming you needed to code, say, an NPC in a game, you would code an “urge” certain way, probably in just a few dozen lines of code.
Hmm, if the NPC didn’t need to be very good, I would use bacteria-level logic like “if you’re getting attacked, move away from the last hit, if you can attack, attack, otherwise move towards the player” or whatever. Then an “urge to be more aggressive” would be, like, move towards the player more quickly, or changing some thresholds. But there’s no foresight / planning here. So that’s not exactly relevant to this post. The rats are making a plan to get saltwater.
So, if I were going to make the NPC better, maybe I would next incorporate planning. Like: Define an “expected reward” function (related to position, health, etc.), consider possible things to do next, and then pick the thing with the highest expected reward at the end. That might be more than a few dozen lines of code … I guess it depends on the library functions available :-P Then you could have an “urge” be expressed through tweaking the parameters of the expected-reward calculation. And now the NPC would then be able to make plans to satisfy the urge, as opposed to just acting based on what’s right in front of it, or whatever.
This thing where it considering multiple possible courses of action—that’s not yet model-based reinforcement learning. There’s no learning! But I would say it’s a first step. There is indeed something like that as an ingredient in AlphaZero, for example.
But that’s actually all that matters for this post anyway, I think. The real RL part−where you learn things—didn’t come up. Maybe I shouldn’t have brought up RL at all, now that I think about it :-P