There is nothing special about human level intelligence, unless you have imitation learning, in which case human level capabilities are very special.
General intelligence is not very efficient. Therefore there will not be any selection pressure for general intelligence as long as other options are available.
Second reply. And this time I actually read the link. I’m not suppressed by that result.
My original comment was a reaction to claims of the type [the best way to solve almost any task is to develop general intelligence, therefore there is a strong selection pressure to become generally intelligent]. I think this is wrong, but I have not yet figured out exactly what the correct view is.
But to use an analogy, it’s something like this: In the example you gave, the AI get’s better at the sub tasks by learning on a more general training set. It seems like general capabilities was useful. But consider that we just trained on even more data for a singel sub task, then wouldn’t it develop general capabilities, since we just noticed that general capabilities was useful for that sub task. I was planing to say “no” but I notice that I do expect some transfer learning. I.e. if you train on just one of the dataset, I expect it to be bad at the other ones, but I also expect it to learn them quicker than without any pre-training.
I seem to expect that AI will develop general capabilities when training on rich enough data, i.e. almost any real world data. LLM is a central example of this.
I think my disagreement with at least my self from some years ago and probably some other people too (but I’ve been away a bit form the discourse so I’m not sure), is that I don’t expect as much agentic long term planing as I used to expect.
I agree that eventually, at some level of trying to solve enough different types of tasks, GI will be efficient, in terms of how much machinery you need, but it will never be able to compete on speed.
Also, it’s an open question what is “enough different types of tasks”. Obviously, for a sufficient broad class of problems GI will be more efficient (in the sense clarified above). Equally obviously, for a sufficient narrow class of problems narrow capabilities will be more efficient.
Humans have GI to some extent, but we mostly don’t use it. This is interesting. This means that a typical human environment is complex enough so that it’s worth carrying around the hardware for GI. But even though we have it, it is evolutionary better to fall back at habits, or imitation, or instinkt, for most situations.
Looking back to exactly what I wrote, I said there will not be any selection pressure for GI as long as other options are available. I’m not super confident in this. But if I’m going to defend it here anyway by pointing out that “as long as other options are available”, is doing a lot of the work here. Some problems are only solvable by noticing deep patterns in reality, and in this case a sufficiently deep NN with sufficient training will learn this, and that is GI.
I mean that you take some known distribution (the training distribution) as a starting point. But when sampling actions you do so from shifted on truncated distribution to favour higher reward policies.
The in the decision transformers I linked, AI is playing a variety of different games, where the programmers might not know what a good future reward value would be. So they let the system AI predict the future reward, but with the distribution shifted towards higher rewards.
I discussed this a bit more after posting the above comment, and there is something I want to add about the comparison.
In quantilizers if you know the probability of DOOM from the base distribution, you get an upper bound on DOOM for the quantaizer. This is not the case for type of probability shift used for the linked decision transformer.
DOOM = Unforeseen catastrophic outcome. Would not be labelled as very bad by the AI’s reward function but is in reality VERY BAD.
From my reading of quantilizers, they might still choose “near-optimal” actions, just only with a small probability. Whereas a system based on decision transformers (possibly combined with a LLM) could be designed that we could then simply tell to “make me a tea of this quantity and quality within this time and with this probability” and it would attempt to do just that, without trying to make more or better tea or faster or with higher probability.
Yes, that is a thing you can do with decision transforms too. I was referring to variant of the decision transformer (see link in original short form) where the AI samples the reward it’s aiming for.
Blogposts are the result of noticing difference in beliefs. Either between you and other of between you and you, across time.
I have lots of ideas that I don’t communicate. Sometimes I read a blogpost and think “yea I knew that, why didn’t I write this”. And the answer is that I did not have an imagined audience.
My blogposts almost always span after I explained a thing ~3 times in meat space. Generalizing from these conversations I form an imagined audience which is some combination of the ~3 people I talked to. And then I can write.
(In a conversation I don’t need to imagine an audience, I can just probe the person in front of me and try different explanations until it works. When writing a blogpost, I don’t have this option. I have to imagine the audience.)
Another way to form an imagined audience is to write for your past self. I’ve noticed that a lot of thig I read are like this. When just learning something or realizing something, and past you who did not know the thing is still fresh in your memory, then it is also easier to write the thing. This short form is of this type.
I wonder if I’m unusually bad at remembering the thoughts and belief’s of past me? My experience is that I pretty quickly forget what it was like not to know a thing. But I see others writing things aimed at their pasts self from years ago.
I think I’m writing short form as a message to my future self, when I have forgotten this insight. I want my future self to remember this idea of how blogposts spawn. I think it will help her guide her writing posts, but also help her not to be annoyed when someone else writes a popular thing that I already knew, and “why did I not write this?” There is an answer to the question “why did I not write this?” and the answer is “because I did not know how to write it”.
A blogpost is a bridge between a land of not knowing and a land of knowing. Knowing the destination of the bridge is not enough to build the bridge. You also have to know the starting point.
If LMs reads each others text we can get LM-memetics. A LM meme is a pattern which, if it exists in the training data, the LM will output at higher frequency that in the training data. If the meme is strong enough and LLMs are trained on enough text from other LMs, the prevalence of the meme can grow exponentially. This has not happened yet.
There can also be memes that has a more complicated life cycle, involving both humans and LMs. If the LM output a pattern that humans are extra interested in, then the humans will multiply that pattern by quoting it in their blogpost, which some other LM will read, which will make the pattern more prevalent in the output of that transformer, possibly.
Generative models memetics:
Same thing can happen for any model trained to imitate the training distribution.
I think an LM-meme is something more than just a frequently repeating pattern. More like frequently repeating patterns with which can infect each other by outputting them into the web or whatever can be included as in a trainibg set for LMs.
There may be other features that are pretty central to the prototype of the (human) meme concept, such as its usefulness for some purpose (ofc not all memes are useful). Maybe this one can be extrapolated to the LM domain, e.g. it helps it presict the next token ir whatever but I’m not sure whether it’s the right move to appropriate the concept of meme for LMs. If we start discovering infectious patterns of this kind, it may be better to think about them as one more subcategory of a general category of replicators of which memes, genes, and prions are another ones.
I’m basically ready to announce the next Technical AI Safety Unconference (TAISU). But I have hit a bit of decision paralysis as to what dates it should be.
If you are reasonably interested in attending, please help me by filling in this doodle
This is probably too obvious to write, but I’m going to say it anyway. It’s my short form, and approximately no-one reads short forms. Or so I’m told.
Human value formation is to a large part steered by other humans suggesting value systems for you. You get some hard to interpret reward signal from your brainstem, or something. There are lots of “hypothesis” for the “correct reward function” you should learn.
(Quotation marks because there are no ground through for what values you should have. But this is mathematically equivalent to a learning the true statistic generating the data, from a finite number of data points. Also, there is maybe some ground truth of what the brainstem rewards, or maybe not. According to Steve the there is this loop, where when the brainstem don’t know if things are good or not, it just mirror back cortex’s own opinion to the cortex.)
To locate the hypothesis, you listen to other humans. I make this claim not just for moral values, but for personal preferences. Maybe someone suggest to you “candy is tasty” and since this seems to fit with your observation, no you also like candy. This is a bad example since for taste specifically the brainstem has pretty clear opinions. Except there is acquired taste… so maybe not a terrible example.
Another example: You join a hobby. You notice you like being at the hobby place doing the hobby thing. Your hobby fired says (i.e. offer the hypothesis) “this hobby is great”. This seems to fit your data so now you believe you like the hobby. And because you believe you like the hobby, you end up actually liking the hobby because of a self reinforcing loop. Although this don’t always work. Maybe after some time your friends quit the hobby and this makes it less fun, and you realise (change your hypothesis) that you manly liked the hobby for the people.
Maybe there is a ground truth about what we want for ourselves? I.e. we can end up with wrong beliefs about what we want due to pear pressure, commercials, etc. But with enough observation we will notice what it is we actually want.
Clearly humans are not 100% malleable, but also, it seems like even our personal preferences are path dependent (i.e. pick up lasting influences from our environment). So maybe some annoying mix...
I disagree. That humans learn values primarily via teaching. 1) parenting is known to have little effect on children’s character—which is one way of saying their values. 2) while children learn to follow rules teens are good at figuring out what is in their interest.
I think it makes sense to pose argue the point though.
For example I think that proposing rules makes it more probable that the brain converges on these solutions.
1) parenting is known to have little effect on children’s character
This is not counter evidence to my claim. The value framework a child learns about from their parents is just one of many value frameworks they hear about from many, many people. My claim is that the power lies in noticing the hypothesis at all. Which ideas you get told more times (e.g. by your parents) don’t matter.
As far as I know, what culture you are in very much influences your values, which my claim would predict.
2) while children learn to follow rules teens are good at figuring out what is in their interest.
Toy model: Each agent has a utility function they want to maximise. The input to the utility function is a list of values describing the state of the world. Different agents can have different input vectors. Assume that every utility function monotonically increases, decreases or stays constant for changes in each impute variable (I did say it was a toy model!). An agent is said to value something if the utility function increases with increasing quantity of that thing. Note that if an agents utility function decreases with increasing quantity of a thing, then the agent values the negative of that thing.
In this toy model agent A is aligned with agent B if and only if A values everything B values.
Q: However does this operationalisation match my intuitive understanding of alignment? A: Good but not perfect.
This definition of alignment is transitive, but not symmetric. This matches the properties I think a definition of alignment should have.
How about if A values a lot of things that B doesn’t care about, and only cares very little about the things A cares about? That would count as aligned in this operationalisation but not necessarily match my intuitive understanding of alignment.
What is alignment? (operationalisation second try)
Agent A is aligned with agent B, if and only if, when we give more power (influence, compute, improved intelligence, etc.) to A, then things get better according to B’s values, and this relation holds for arbitrary increases of power.
This operationalisation points to exactly what we want, but is also not very helpful.
The not so operationalized answer is that a good operationalization is one that are helpful for achieving alignment.
An operationalization of [helpfulness of an operationalization] would give some sorts to gears level understanding of what shape the operationalization should have to be helpful. I don’t have any good model for this, so I will just gesture vaguely.
I think that mathematical descriptions are good, since they are more precise. My first operationalization attempt is pretty mathematical which is good. It is also more “constructive” (not sure if this is the exact right word), i.e. it describes alignment in terms of internal properties, rather than outcomes. Internal properties are more useful as design guidelines, as long as they are correct. The big problem with my first operationalization is that it don’t actually point to what we want.
The problem with the second attempt is that it just states what outcome we want. There is nothing in there to help us achieve it.
Can’t you restate the second one as the relationship between two utility functions UA and UB such that increasing one (holding background conditions constant) is guaranteed not to decrease the other? I.e. their respective derivatives are always non-negative for every background condition.
Yes, I like this one. We don’t want the AI to find a way to give it self utility while making things worse for us. And if we are trying to make things better for us, we don’t want the AI to resist us.
Do you want to find out what these inequalities implies about the utility functions? Can you find examples where your condition is true for non-identical functions?
UA changes only when MA (A’s world model) changes which is ultimately caused by new observations, i.e. changes in the world state (let’s assume that both A and B perceive the world quite accurately).
If whenever UA changes UB doesn’t decrease, then whatever change in the world increased UA, B at least doesn’t care. This is problematic when A and B need the same scarce resources (instrumental convergence etc). It could be satisfied if they were both satisficers or bounded agents inhabiting significantly disjoint niches.
A robust solution seems seems to be to make (super accurately modeled) UB a major input to UA.
Then (I think) for your inequality to hold, it must be that
U_B = f(3x+y), where f’ >= 0
If U_B care about x and y in any other proportion, then B can make trade-offs between x and y which makes things better for B, but worse for A.
This will be true (in theory) even if both A and B are satisfisers. You can see this by assuming replacing y and x with sigmoids of some other variables.
Any policy can be model as a consequentialist agent, if you assume a contrived enough utility function. This statement is true, but not helpful.
The reason we care about the concept agency, is because there are certain things we expect from consequentialist agents, e.g. instrumental convergent goals, or just optimisation pressure in some consistent direction. We care about the concept of agency because it holds some predictive power.
[… some steps of reasoning I don’t know yet how to explain …]
Therefore, it’s better to use a concept of agency that depend on the internal properties of an algorithm/mind/policy-generator.
I don’t think agency can be made into a crisp concept. It’s either a fuzzy category or a leaky abstraction depending on how you apply the concept. But it does point to something important. I think it is worth tracking how agentic different systems are, because doing so has predictive power.
I recently updated how I view the alignment problem. The post that caused my update is this one form the shard sequence. Also worth mentioning is older post that points to the same thing, but I just happen to read it later.
Basically I used to think we needed to solve both outer and inner alignment separately. No I no longer think this is a good decomposition of the problem.
It’s not obvious that alignment must factor in the way described above. There is room for trying to set up training in such a way to guarantee a friendly mesa-objective somehow without matching it to a friendly base-objective. That is: to align the AI directly to its human operator, instead of aligning the AI to the reward, and the reward to the human.
Todays hot takes (or something)
There is nothing special about human level intelligence, unless you have imitation learning, in which case human level capabilities are very special.
General intelligence is not very efficient. Therefore there will not be any selection pressure for general intelligence as long as other options are available.
The no free lunch theorem only says that you can’t learn to predict noise.
GI is very efficient, if you consider that you can reuse a lot machinery that you learn, rather than needing to relearn it over and over again. https://towardsdatascience.com/what-is-better-one-general-model-or-many-specialized-models-9500d9f8751d
Second reply. And this time I actually read the link.
I’m not suppressed by that result.
My original comment was a reaction to claims of the type [the best way to solve almost any task is to develop general intelligence, therefore there is a strong selection pressure to become generally intelligent]. I think this is wrong, but I have not yet figured out exactly what the correct view is.
But to use an analogy, it’s something like this: In the example you gave, the AI get’s better at the sub tasks by learning on a more general training set. It seems like general capabilities was useful. But consider that we just trained on even more data for a singel sub task, then wouldn’t it develop general capabilities, since we just noticed that general capabilities was useful for that sub task. I was planing to say “no” but I notice that I do expect some transfer learning. I.e. if you train on just one of the dataset, I expect it to be bad at the other ones, but I also expect it to learn them quicker than without any pre-training.
I seem to expect that AI will develop general capabilities when training on rich enough data, i.e. almost any real world data. LLM is a central example of this.
I think my disagreement with at least my self from some years ago and probably some other people too (but I’ve been away a bit form the discourse so I’m not sure), is that I don’t expect as much agentic long term planing as I used to expect.
I agree that eventually, at some level of trying to solve enough different types of tasks, GI will be efficient, in terms of how much machinery you need, but it will never be able to compete on speed.
Also, it’s an open question what is “enough different types of tasks”. Obviously, for a sufficient broad class of problems GI will be more efficient (in the sense clarified above). Equally obviously, for a sufficient narrow class of problems narrow capabilities will be more efficient.
Humans have GI to some extent, but we mostly don’t use it. This is interesting. This means that a typical human environment is complex enough so that it’s worth carrying around the hardware for GI. But even though we have it, it is evolutionary better to fall back at habits, or imitation, or instinkt, for most situations.
Looking back to exactly what I wrote, I said there will not be any selection pressure for GI as long as other options are available. I’m not super confident in this. But if I’m going to defend it here anyway by pointing out that “as long as other options are available”, is doing a lot of the work here. Some problems are only solvable by noticing deep patterns in reality, and in this case a sufficiently deep NN with sufficient training will learn this, and that is GI.
I like that description of NFL!
Re: your hot take on general intelligence, see: “Is General Intelligence Compact?”
Decision transformers ≈ Quantilizers
You mean, in that you can simply prompt for a reasonable non-infinite performance and get said outcome?
Similar but not exactly.
I mean that you take some known distribution (the training distribution) as a starting point. But when sampling actions you do so from shifted on truncated distribution to favour higher reward policies.
The in the decision transformers I linked, AI is playing a variety of different games, where the programmers might not know what a good future reward value would be. So they let the system AI predict the future reward, but with the distribution shifted towards higher rewards.
I discussed this a bit more after posting the above comment, and there is something I want to add about the comparison.
In quantilizers if you know the probability of DOOM from the base distribution, you get an upper bound on DOOM for the quantaizer. This is not the case for type of probability shift used for the linked decision transformer.
DOOM = Unforeseen catastrophic outcome. Would not be labelled as very bad by the AI’s reward function but is in reality VERY BAD.
From my reading of quantilizers, they might still choose “near-optimal” actions, just only with a small probability. Whereas a system based on decision transformers (possibly combined with a LLM) could be designed that we could then simply tell to “make me a tea of this quantity and quality within this time and with this probability” and it would attempt to do just that, without trying to make more or better tea or faster or with higher probability.
Yes, that is a thing you can do with decision transforms too. I was referring to variant of the decision transformer (see link in original short form) where the AI samples the reward it’s aiming for.
Blogposts are the result of noticing difference in beliefs.
Either between you and other of between you and you, across time.
I have lots of ideas that I don’t communicate. Sometimes I read a blogpost and think “yea I knew that, why didn’t I write this”. And the answer is that I did not have an imagined audience.
My blogposts almost always span after I explained a thing ~3 times in meat space. Generalizing from these conversations I form an imagined audience which is some combination of the ~3 people I talked to. And then I can write.
(In a conversation I don’t need to imagine an audience, I can just probe the person in front of me and try different explanations until it works. When writing a blogpost, I don’t have this option. I have to imagine the audience.)
Another way to form an imagined audience is to write for your past self. I’ve noticed that a lot of thig I read are like this. When just learning something or realizing something, and past you who did not know the thing is still fresh in your memory, then it is also easier to write the thing. This short form is of this type.
I wonder if I’m unusually bad at remembering the thoughts and belief’s of past me? My experience is that I pretty quickly forget what it was like not to know a thing. But I see others writing things aimed at their pasts self from years ago.
I think I’m writing short form as a message to my future self, when I have forgotten this insight. I want my future self to remember this idea of how blogposts spawn. I think it will help her guide her writing posts, but also help her not to be annoyed when someone else writes a popular thing that I already knew, and “why did I not write this?” There is an answer to the question “why did I not write this?” and the answer is “because I did not know how to write it”.
A blogpost is a bridge between a land of not knowing and a land of knowing. Knowing the destination of the bridge is not enough to build the bridge. You also have to know the starting point.
LM memetics:
LM = language model (e.g. GPT-3)
If LMs reads each others text we can get LM-memetics. A LM meme is a pattern which, if it exists in the training data, the LM will output at higher frequency that in the training data. If the meme is strong enough and LLMs are trained on enough text from other LMs, the prevalence of the meme can grow exponentially. This has not happened yet.
There can also be memes that has a more complicated life cycle, involving both humans and LMs. If the LM output a pattern that humans are extra interested in, then the humans will multiply that pattern by quoting it in their blogpost, which some other LM will read, which will make the pattern more prevalent in the output of that transformer, possibly.
Generative models memetics:
Same thing can happen for any model trained to imitate the training distribution.
This mechanism may not require LMs to be involved.
Not sure what you mean exactly. But yes, memetics without AI does exist.
https://en.wikipedia.org/wiki/Memetics
I think an LM-meme is something more than just a frequently repeating pattern. More like frequently repeating patterns with which can infect each other by outputting them into the web or whatever can be included as in a trainibg set for LMs.
There may be other features that are pretty central to the prototype of the (human) meme concept, such as its usefulness for some purpose (ofc not all memes are useful). Maybe this one can be extrapolated to the LM domain, e.g. it helps it presict the next token ir whatever but I’m not sure whether it’s the right move to appropriate the concept of meme for LMs. If we start discovering infectious patterns of this kind, it may be better to think about them as one more subcategory of a general category of replicators of which memes, genes, and prions are another ones.
I’m basically ready to announce the next Technical AI Safety Unconference (TAISU). But I have hit a bit of decision paralysis as to what dates it should be.
If you are reasonably interested in attending, please help me by filling in this doodle
If you don’t know what this is about, have a look at the information for the last one.
The venue will be EA Hotel in Blackpool UK again.
This is probably too obvious to write, but I’m going to say it anyway. It’s my short form, and approximately no-one reads short forms. Or so I’m told.
Human value formation is to a large part steered by other humans suggesting value systems for you. You get some hard to interpret reward signal from your brainstem, or something. There are lots of “hypothesis” for the “correct reward function” you should learn.
(Quotation marks because there are no ground through for what values you should have. But this is mathematically equivalent to a learning the true statistic generating the data, from a finite number of data points. Also, there is maybe some ground truth of what the brainstem rewards, or maybe not. According to Steve the there is this loop, where when the brainstem don’t know if things are good or not, it just mirror back cortex’s own opinion to the cortex.)
To locate the hypothesis, you listen to other humans. I make this claim not just for moral values, but for personal preferences. Maybe someone suggest to you “candy is tasty” and since this seems to fit with your observation, no you also like candy. This is a bad example since for taste specifically the brainstem has pretty clear opinions. Except there is acquired taste… so maybe not a terrible example.
Another example: You join a hobby. You notice you like being at the hobby place doing the hobby thing. Your hobby fired says (i.e. offer the hypothesis) “this hobby is great”. This seems to fit your data so now you believe you like the hobby. And because you believe you like the hobby, you end up actually liking the hobby because of a self reinforcing loop. Although this don’t always work. Maybe after some time your friends quit the hobby and this makes it less fun, and you realise (change your hypothesis) that you manly liked the hobby for the people.
Maybe there is a ground truth about what we want for ourselves? I.e. we can end up with wrong beliefs about what we want due to pear pressure, commercials, etc. But with enough observation we will notice what it is we actually want.
Clearly humans are not 100% malleable, but also, it seems like even our personal preferences are path dependent (i.e. pick up lasting influences from our environment). So maybe some annoying mix...
Somebody is reading shortforms...
I disagree. That humans learn values primarily via teaching. 1) parenting is known to have little effect on children’s character—which is one way of saying their values. 2) while children learn to follow rules teens are good at figuring out what is in their interest.
I think it makes sense to pose argue the point though.
For example I think that proposing rules makes it more probable that the brain converges on these solutions.
This is not counter evidence to my claim. The value framework a child learns about from their parents is just one of many value frameworks they hear about from many, many people. My claim is that the power lies in noticing the hypothesis at all. Which ideas you get told more times (e.g. by your parents) don’t matter.
As far as I know, what culture you are in very much influences your values, which my claim would predict.
I’m not making any claims about rule following.
What is alignment? (operationalisation)
Toy model: Each agent has a utility function they want to maximise. The input to the utility function is a list of values describing the state of the world. Different agents can have different input vectors. Assume that every utility function monotonically increases, decreases or stays constant for changes in each impute variable (I did say it was a toy model!). An agent is said to value something if the utility function increases with increasing quantity of that thing. Note that if an agents utility function decreases with increasing quantity of a thing, then the agent values the negative of that thing.
In this toy model agent A is aligned with agent B if and only if A values everything B values.
Q: However does this operationalisation match my intuitive understanding of alignment?
A: Good but not perfect.
This definition of alignment is transitive, but not symmetric. This matches the properties I think a definition of alignment should have.
How about if A values a lot of things that B doesn’t care about, and only cares very little about the things A cares about? That would count as aligned in this operationalisation but not necessarily match my intuitive understanding of alignment.
What is alignment? (operationalisation second try)
Agent A is aligned with agent B, if and only if, when we give more power (influence, compute, improved intelligence, etc.) to A, then things get better according to B’s values, and this relation holds for arbitrary increases of power.
This operationalisation points to exactly what we want, but is also not very helpful.
Re second try: what would make a high-level operationalisation of that sort helpful? (operationalize the helpfulness of an operationalisation)
This is a good question.
The not so operationalized answer is that a good operationalization is one that are helpful for achieving alignment.
An operationalization of [helpfulness of an operationalization] would give some sorts to gears level understanding of what shape the operationalization should have to be helpful. I don’t have any good model for this, so I will just gesture vaguely.
I think that mathematical descriptions are good, since they are more precise. My first operationalization attempt is pretty mathematical which is good. It is also more “constructive” (not sure if this is the exact right word), i.e. it describes alignment in terms of internal properties, rather than outcomes. Internal properties are more useful as design guidelines, as long as they are correct. The big problem with my first operationalization is that it don’t actually point to what we want.
The problem with the second attempt is that it just states what outcome we want. There is nothing in there to help us achieve it.
Can’t you restate the second one as the relationship between two utility functions UA and UB such that increasing one (holding background conditions constant) is guaranteed not to decrease the other? I.e. their respective derivatives are always non-negative for every background condition.
∂UA∂UB≥0∧∂UB∂UA≥0
Yes, I like this one. We don’t want the AI to find a way to give it self utility while making things worse for us. And if we are trying to make things better for us, we don’t want the AI to resist us.
Do you want to find out what these inequalities implies about the utility functions? Can you find examples where your condition is true for non-identical functions?
I don’t have a specific example right now but some things that come to mind:
Both utility functions ultimately depend in some way on a subset of background conditions, i.e. the world state
The world state influences the utility functions through latent variables in the agents’ world models, to which they are inputs.
UA changes only when MA (A’s world model) changes which is ultimately caused by new observations, i.e. changes in the world state (let’s assume that both A and B perceive the world quite accurately).
If whenever UA changes UB doesn’t decrease, then whatever change in the world increased UA, B at least doesn’t care. This is problematic when A and B need the same scarce resources (instrumental convergence etc). It could be satisfied if they were both satisficers or bounded agents inhabiting significantly disjoint niches.
A robust solution seems seems to be to make (super accurately modeled) UB a major input to UA.
Lets say that
U_A = 3x + y
Then (I think) for your inequality to hold, it must be that
U_B = f(3x+y), where f’ >= 0
If U_B care about x and y in any other proportion, then B can make trade-offs between x and y which makes things better for B, but worse for A.
This will be true (in theory) even if both A and B are satisfisers. You can see this by assuming replacing y and x with sigmoids of some other variables.
Any policy can be model as a consequentialist agent, if you assume a contrived enough utility function. This statement is true, but not helpful.
The reason we care about the concept agency, is because there are certain things we expect from consequentialist agents, e.g. instrumental convergent goals, or just optimisation pressure in some consistent direction. We care about the concept of agency because it holds some predictive power.
[… some steps of reasoning I don’t know yet how to explain …]
Therefore, it’s better to use a concept of agency that depend on the internal properties of an algorithm/mind/policy-generator.
I don’t think agency can be made into a crisp concept. It’s either a fuzzy category or a leaky abstraction depending on how you apply the concept. But it does point to something important. I think it is worth tracking how agentic different systems are, because doing so has predictive power.
I recently updated how I view the alignment problem. The post that caused my update is this one form the shard sequence. Also worth mentioning is older post that points to the same thing, but I just happen to read it later.
Basically I used to think we needed to solve both outer and inner alignment separately. No I no longer think this is a good decomposition of the problem.
Quote from here