Someone asked for this file, so I thought it would be interesting to share it publicly. Notably this is directly taken from my internal notes, and so may have some weird &/or (very) wrong things in it, and some parts may not be understandable. Feel free to ask for clarification where needed.
I want a way to take an agent, and figure out what its values are. For this, we need to define abstract structures within the agent such that any values-like stuff in any part of the agent ends up being shunted off to a particular structure in our overall agent schematic after a number of gradient steps.
Given an agent which has been optimized for a particular objective in a particular environment, there will be convergent bottlenecks in the environment it will need to solve in order to make progress. One of these is power-seeking, but another one of these could be quadratic-equation solvers, or something like solving linear programs. These structures will be reward-function-independent[1]. These structures will be recursive, and we should expect them to be made out of even-more-convergent structures.
How do shards pop out of this? In the course of optimizing our agent, some of our solvers may have a bias towards leading our agent towards situations which more require their use. We may also see this kind of behavior in groups of solvers, where solver_1() leads the agent into situations requiring solver_2(), which leads the agent into situations requiring solver_1(). In the course of optimizing our agent (at least at first), we will be more likely to find these kinds of solvers, since solvers which often lead the agent into situations requiring solvers the agent does not yet have have no immediate gradient pointing towards them (since if the agent tried to use that solver, it would just end up being confused once it entered the new situation), so we are left only selecting for solvers which mostly lead the agent into situations it knows how to deal with.
Why we need to enforce exploration behavior: otherwise solver-loops will be far too short & simple to do anything complicated with. Solvers will be simple because not much time has passed, and simple solvers which enter states which require previous simple solvers will be wayyy increased. Randomization of actions decreases this selection effect, because the agent’s actions are less correlated with which solver was active.
Solvers which are very convergent need not enter into solver-cycles, since every solver-cycle will end up using them.
Good news against powerseeking naiveley?
What happens if we call these solver-cycles shards?
baby-candy example in this frame: Baby starts with zero solvers, just a bunch of random noise. After it reaches over and puts candy in its mouth many times, it gets the identify_candy_and_coordinate_hand_movements_to_put_in_mouth() solver[2]. Very specific, but with pieces of usefulness. The sections vaguely devoted to identifying objects (like implicit edge-detectors) will get repurposed for general vision processing, and the sections devoted to coordinating hand movements will also get repurposed to many different goals. The candy bit, and put-in-mouth bit only end up surviving if they can lead the agent to worlds which reinforce candy-abstractions and putting-things-in-mouth abstractions. Other parts don’t reqlly need to try.
Brains aren’t modular! So why expect solvers to be?
I like this line of speculation, although it feels subtly off to me.
This seems like it would mean I care about moving my arms more than I care about candy, because I use my arms for so many things. However, I feel like I care more about moving my arms than eating candy.
Though maybe part of this is that candy makes me feel bad when I eat it. What about walking in parks or looking at beautiful sunsets? I definiteley care about those more than moving my arms I think? And I don’t gain intrinsic value from moving my arms, only power-value I think?
power is a weird thing, because it’s highly convergent, but also it doesn’t seem that hard for such a solver to put a bit of optimization power towards “also, reinforce power-seeking-solver” and end up successful.
Well… it’s unclear what their values would be.
Maybe it’d effectiveley be probability-of-being-activated-again?
It wouldn’t be, but I do think there’s something to ‘discrete values-like objects lie in solver-cycles’.
Perhaps we can watch this happen via some kind of markov-chain-like-thing?
Put the agent into a situation, look at what its activation patterns look like, allow it to be in a new situation, look at the activation patterns again, etc.
Suppose each activation is a unique solver, and the ground-truth looks like so
where the dots labled 1, 2, and 3 are the solver-activations, so that if 1 will try to get 2 activated, 2 will try to get 1 active, and 3 will try to get itself active[3]. If 1 is active, we expect the activation on 2 to be positve, and on 3 to be negative or zero.
As per end of footnote, I think the correct way to operationalize active here, is something to do with whether or not that particular solver is reinforced or disinforced after the gradient update.
There will be some shards which we probably can’t avoid. But also, if we have a good understanding of the convergent problems in an environment, we should be able to predict what the first few solvers are, and solvers after those should mostly build upon the previous solvers’ loop-coalitions?
Re: Agray’s example about motor movements in the brain, and how likely you’ll see a jonbled mess of lots of stuff causing lots of other stuff to happen, even though movement is highly instrumental valuable:
I think even if he’s right, many of the arguments here still hold. Each section of processing still needs to be paying rent to stay in the agent. Either by supporting other sections or getting reward or steering the agent away from situations which would decrease its usefulness.
So though it may not make sense to think of APIs between different sections, may still be useful for framing the picture, then imagine how the APIs will get obliterated by SGD, or maybe we can formulate stuff without the use of APIs
Though we do see things get lower dimensional, and so if John’s right, there should be some framing by which in fact what’s going on passes through constraint functions...
Not including “weird” utility functions. I’m talking about most utility functions. Perhaps we can formalize this in a way similar to TurnTrout’s formalization in powerseeking if we really needed to.↩︎
Note that this is all going to be a blended mess of spaghetti-coded mush, which does everything at the same time, with some parts which are vaguely closer to edge-detection, and other parts which vagueley look like motor control. This function is very much not going to be modular, and if you want to say APIs between different parts of the function exist, they’re going to look like very high-dimensional ones.↩︎
Where the magnitude of a particular activation can be defined as something like the absolute value of the gradient of the final decision with respect to that activation. Or mag(a)=|Δaf(p,r;θ)|where a is the variable representing the activation, f is the function representing our network, p is the percept our network gets about the state, r is it’s reccurency, and θ are the parameters of our network. We may also want to define this in terms of collections of weights too, perhaps having to do with Lucius’s features stuff. Don’t get tied to this. Possibly we want just the partial derivative of the action actually taken with respect to a, or really, the partial of the highest-valued-output action taken. And I want a way to talk about dis-enforcing stuff too. Maybe we just re-run the network on this input after taking a gradient step, then see whether a has gone up or down. That seems safer.↩︎
Someone asked for this file, so I thought it would be interesting to share it publicly. Notably this is directly taken from my internal notes, and so may have some weird &/or (very) wrong things in it, and some parts may not be understandable. Feel free to ask for clarification where needed.
I want a way to take an agent, and figure out what its values are. For this, we need to define abstract structures within the agent such that any values-like stuff in any part of the agent ends up being shunted off to a particular structure in our overall agent schematic after a number of gradient steps.
Given an agent which has been optimized for a particular objective in a particular environment, there will be convergent bottlenecks in the environment it will need to solve in order to make progress. One of these is power-seeking, but another one of these could be quadratic-equation solvers, or something like solving linear programs. These structures will be reward-function-independent[1]. These structures will be recursive, and we should expect them to be made out of even-more-convergent structures.
How do shards pop out of this? In the course of optimizing our agent, some of our solvers may have a bias towards leading our agent towards situations which more require their use. We may also see this kind of behavior in groups of solvers, where
solver_1()
leads the agent into situations requiringsolver_2()
, which leads the agent into situations requiringsolver_1()
. In the course of optimizing our agent (at least at first), we will be more likely to find these kinds of solvers, since solvers which often lead the agent into situations requiring solvers the agent does not yet have have no immediate gradient pointing towards them (since if the agent tried to use that solver, it would just end up being confused once it entered the new situation), so we are left only selecting for solvers which mostly lead the agent into situations it knows how to deal with.Why we need to enforce exploration behavior: otherwise solver-loops will be far too short & simple to do anything complicated with. Solvers will be simple because not much time has passed, and simple solvers which enter states which require previous simple solvers will be wayyy increased. Randomization of actions decreases this selection effect, because the agent’s actions are less correlated with which solver was active.
Solvers which are very convergent need not enter into solver-cycles, since every solver-cycle will end up using them.
Good news against powerseeking naiveley?
What happens if we call these solver-cycles shards?
baby-candy example in this frame: Baby starts with zero solvers, just a bunch of random noise. After it reaches over and puts candy in its mouth many times, it gets the
identify_candy_and_coordinate_hand_movements_to_put_in_mouth()
solver[2]. Very specific, but with pieces of usefulness. The sections vaguely devoted to identifying objects (like implicit edge-detectors) will get repurposed for general vision processing, and the sections devoted to coordinating hand movements will also get repurposed to many different goals. The candy bit, and put-in-mouth bit only end up surviving if they can lead the agent to worlds which reinforce candy-abstractions and putting-things-in-mouth abstractions. Other parts don’t reqlly need to try.Brains aren’t modular! So why expect solvers to be?
I like this line of speculation, although it feels subtly off to me.
This seems like it would mean I care about moving my arms more than I care about candy, because I use my arms for so many things. However, I feel like I care more about moving my arms than eating candy.
Though maybe part of this is that candy makes me feel bad when I eat it. What about walking in parks or looking at beautiful sunsets? I definiteley care about those more than moving my arms I think? And I don’t gain intrinsic value from moving my arms, only power-value I think?
power is a weird thing, because it’s highly convergent, but also it doesn’t seem that hard for such a solver to put a bit of optimization power towards “also, reinforce power-seeking-solver” and end up successful.
Well… it’s unclear what their values would be.
Maybe it’d effectiveley be probability-of-being-activated-again?
It wouldn’t be, but I do think there’s something to ‘discrete values-like objects lie in solver-cycles’.
Perhaps we can watch this happen via some kind of markov-chain-like-thing?
Put the agent into a situation, look at what its activation patterns look like, allow it to be in a new situation, look at the activation patterns again, etc.
Suppose each activation is a unique solver, and the ground-truth looks like so
where the dots labled 1, 2, and 3 are the solver-activations, so that if 1 will try to get 2 activated, 2 will try to get 1 active, and 3 will try to get itself active[3]. If 1 is active, we expect the activation on 2 to be positve, and on 3 to be negative or zero.
As per end of footnote, I think the correct way to operationalize active here, is something to do with whether or not that particular solver is reinforced or disinforced after the gradient update.
what are these solvers actually in the network?
Lucius modules maybe?
There will be some shards which we probably can’t avoid. But also, if we have a good understanding of the convergent problems in an environment, we should be able to predict what the first few solvers are, and solvers after those should mostly build upon the previous solvers’ loop-coalitions?
Re: Agray’s example about motor movements in the brain, and how likely you’ll see a jonbled mess of lots of stuff causing lots of other stuff to happen, even though movement is highly instrumental valuable:
I think even if he’s right, many of the arguments here still hold. Each section of processing still needs to be paying rent to stay in the agent. Either by supporting other sections or getting reward or steering the agent away from situations which would decrease its usefulness.
So though it may not make sense to think of APIs between different sections, may still be useful for framing the picture, then imagine how the APIs will get obliterated by SGD, or maybe we can formulate stuff without the use of APIs
Though we do see things get lower dimensional, and so if John’s right, there should be some framing by which in fact what’s going on passes through constraint functions...
Not including “weird” utility functions. I’m talking about most utility functions. Perhaps we can formalize this in a way similar to TurnTrout’s formalization in powerseeking if we really needed to.↩︎
Note that this is all going to be a blended mess of spaghetti-coded mush, which does everything at the same time, with some parts which are vaguely closer to edge-detection, and other parts which vagueley look like motor control. This function is very much not going to be modular, and if you want to say APIs between different parts of the function exist, they’re going to look like very high-dimensional ones.↩︎
Where the magnitude of a particular activation can be defined as something like the absolute value of the gradient of the final decision with respect to that activation. Or
mag(a)=|Δaf(p,r;θ)|where a is the variable representing the activation, f is the function representing our network, p is the percept our network gets about the state, r is it’s reccurency, and θ are the parameters of our network. We may also want to define this in terms of collections of weights too, perhaps having to do with Lucius’s features stuff.
Don’t get tied to this. Possibly we want just the partial derivative of the action actually taken with respect to a, or really, the partial of the highest-valued-output action taken. And I want a way to talk about dis-enforcing stuff too. Maybe we just re-run the network on this input after taking a gradient step, then see whether a has gone up or down. That seems safer.↩︎