I generally agree with most of this, but I think it misses the main claim I wanted to make. I totally agree that all three eras of MIRI’s agent foundations research had some vision of the general theory of agency behind them, driving things. My point of disagreement is that, for most of MIRI’s history, elucidating that general theory has not been the primary optimization objective.
Let’s go through some examples.
The Sequences: we can definitely see Eliezer’s understanding of the general theory of agency in many places, especially when talking about Bayes and utility. (Engines of Cognition is a central example.) But most of the sequences talk about things like failure modes of human cognition, how to actually change your mind, social failure modes of human cognition, etc. It sure looks like the primary optimization objective is about better human thinking, plus some general philosophical foundations, not the elucidation of the general theory of agency.
Tiling agents and proof-based decision theories: I’m on board with the use of proof-based setups to make minimal assumptions about “the substrate that the agency is made of”. That’s an entirely reasonable choice, and it does look like that choice was driven (in large part) by a desire for the theory to apply quite generally. But these models don’t look like they were ever intended as general models of agency (I doubt they would apply nicely to e-coli); in your words, they provided “another system that is easy to say things about that could be used to triangulate agency in general”. That’s not necessarily a bad step on the road to general theory, but the general theory itself was not the main thing those models were doing. (Personally, I think we already have enough points to triangulate from for the time being. I think if someone were just directly, explicitly optimizing for a general theory of agency they’d probably come to that same conclusion. On the other hand, I could imagine someone very focused on self-reference barriers in particular might end up hunting for more data points, and it’s plausible that someone directly optimizing for a general theory of agency would end up focused mostly on self-reference.)
Grain of truth: similar to tiling agents and proof-based decision theories, this sounds like “another system that is easy to say things about that could be used to triangulate agency in general”. It does not sound like a part of the general theory of agency in its own right.
Logical induction: here we see something which probably would apply to an e-coli; it does sound like a part of a general theory of agency. (For the peanut gallery: I’m talking about LI criterion here, not the particular algorithm.) On the other hand, I wouldn’t expect it to say much of interest about an e-coli beyond what we already know from older coherence theorems. It’s still mainly of interest in problems of reflection. And I totally buy that reflection is an important bottleneck to the general theory of agency, but I wouldn’t expect to see such a singular focus on that one bottleneck if someone were directly optimizing for a general theory of agency as their primary objective.
Embedded agents: in your own words, you “started by taking the stuff that MIRI has already been working on, mostly the artifacts of the Benya Era, and trying to communicate the central justification that would cause one to be interested in these topics”. You did not start by taking all the different agenty systems you could think of, and trying to communicate the central concept that would cause one to be interested in those systems. I do think embedded agency came closer than any other example on this list to tackling the general theory of agency, but it still wasn’t directly optimizing for that as the primary objective.
Going down that list (and looking at your more recent work), it definitely looks like research has been more and more directly pointed at the general theory of agency over time. But it also looks like that was not the primary optimization objective over most of MIRI’s history, which is why I don’t think slow progress on agent foundations to date provides strong evidence that the field is very difficult. Conversely, I’ve seen firsthand how tractable things are when I do optimize directly for a general theory of agency, and based on that experience I expect fairly fast progress.
(Addendum for the peanut gallery: I don’t mean to bash any of this work; every single thing on the list was at least great work, and a lot of it was downright brilliant. There’s a reason I said MIRI is the best org at this kind of work. My argument is just that it doesn’t provide especially strong evidence that agent foundations are hard, because the work wasn’t directly optimizing for the general theory of agency as its primary objective.)
Hmm, yeah, we might disagree about how much reflection(self-reference) is a central part of agency in general.
It seems plausible that it is important to distinguish between the e-coli and the human along a reflection axis (or even more so, distinguish between evolution and a human). Then maybe you are more focused on the general class of agents, and MIRI is more focused on the more specific class of “reflective agents.”
Then, there is the question of whether reflection is going to be a central part of the path to (F/D)OOM.
To operationalize, I claim that MIRI has been directed at a close enough target to yours that you probably should update on MIRI’s lack of progress at least as much as you would if MIRI was doing the same thing as you, but for half as long.
Which isn’t *that* large an update. The average number of agent foundations researchers (That are public facing enough that you can update on their lack of progress) at MIRI over the last decade is like 4.
Figuring out how to factor in researcher quality is hard, but it seems plausible to me that the amount of quality adjusted attention directed at your subgoal over the next decade is significantly larger than the amount of attention directed at your subgoal over the last decade. (Which would not all come from you. I do think that Agent Foundations today is non-trivially closer to John today that Agent Foundations 5 years ago is to John today.)
It seems accurate to me to say that Agent Foundations in 2014 was more focused on reflection, which shifted towards embeddedness, and then shifted towards abstraction, and that these things all flow together in my head, and so Scott thinking about abstraction will have more reflection mixed in than John thinking about abstraction. (Indeed, I think progress on abstraction would have huge consequences on how we think about reflection.)
In case it is not obvious to people reading, I endorse John’s research program. (Which can maybe be inferred by the fact that I am arguing that it is similar to my own). I think we disagree about what is the most likely path after becoming less confused about agency, but that part of both our plans is yet to be written, and I think the subgoal is enough of a simple concept that I don’t think disagreements about what to do next to have a strong impact on how to do the first step.
In particular, for folks reading, I symmetrically agree with this part:
In case it is not obvious to people reading, I endorse John’s research program. (Which can maybe be inferred by the fact that I am arguing that it is similar to my own). I think we disagree about what is the most likely path after becoming less confused about agency, but that part of both our plans is yet to be written, and I think the subgoal is enough of a simple concept that I don’t think disagreements about what to do next to have a strong impact on how to do the first step.
… i.e. I endorse Scott’s research program, mine is indeed similar, I wouldn’t be the least bit surprised if we disagree about what comes next but we’re pretty aligned on what to do now.
Also, I realize now that I didn’t emphasize it in the OP, but a large chunk of my “50/50 chance of success” comes from other peoples’ work playing a central role, and the agent foundations team at MIRI is obviously at the top of the list of people whose work is likely to fit that bill. (There’s also the whole topic of producing more such people, which I didn’t talk about in the OP at all, but I’m tentatively optimistic on that front too.)
I do expect reflection to be a pretty central part of the path to FOOM, but I expect it to be way easier to analyze once the non-reflective foundations of agency are sorted out. There are good reasons to expect otherwise on an outside view—i.e. all the various impossibility results in logic and computing. On the other hand, my inside view says it will make more sense once we understand e.g. how abstraction produces maps smaller than the territory while still allowing robust reasoning, how counterfactuals naturally pop out of such abstractions, how that all leads to something conceptually like a Cartesian boundary, the relationship between abstract “agent” and the physical parts which comprise the agent, etc.
If I imagine what my work would look like if I started out expecting reflection to be the taut constraint, then it does seem like I’d follow a path a lot more like MIRI’s. So yeah, this fits.
If I imagine what my work would look like if I started out expecting reflection to be the taut constraint, then it does seem like I’d follow a path a lot more like MIRI’s. So yeah, this fits.
One thing I’m still not clear about in this thread is whether you (John) would feel that progress has been made for the theory of agency if all the problems on which MIRI were instantaneously solved. Because there’s a difference between saying “this is the obvious first step if you believe reflection is the taut constraint” and “solving this problem would help significantly even if reflection wan’t the taut constraint”.
I expect that progress on the general theory of agency is a necessary component of solving all the problems on which MIRI has worked. So, conditional on those problems being instantly solved, I’d expect that a lot of general theory of agency came along with it. But if a “solution” to something like e.g. the Tiling Problem didn’t come with a bunch of progress on more foundational general theory of agency, then I’d be very suspicious of that supposed solution, and I’d expect lots of problems to crop up when we try to apply the solution in practice.
(And this is not symmetric: I would not necessarily expect such problems in practice for some more foundational piece of general agency theory which did not already have a solution to the Tiling Problem built into it. Roughly speaking, I expect we can understand e-coli agency without fully understanding human agency, but not vice-versa.)
One thing I am confused about is whether to think of the e-coli as qualitatively different from the human. The e-coli is taking actions that can be well modeled by an optimization process searching for actions that would be good if this optimization process output them, which has some reflection in it.
It feels like it can behaviorally be well modeled this way, but is mechanistically not shaped like this, I feel like the mechanistic fact is more important, but I feel like we are much closer to having behavioral definitions of agency than mechanistic ones.
I would say the e-coli’s fitness function has some kind of reflection baked into it, as does a human’s fitness function. The qualitative difference between the two is that a human’s own world model also has an explicit self-model in it, which is separate from the reflection baked into a human’s fitness function.
After that, I’d say that deriving the (probable) mechanistic properties from the fitness functions is the name of the game.
… so yeah, I’m on basically the same page as you here.
I generally agree with most of this, but I think it misses the main claim I wanted to make. I totally agree that all three eras of MIRI’s agent foundations research had some vision of the general theory of agency behind them, driving things. My point of disagreement is that, for most of MIRI’s history, elucidating that general theory has not been the primary optimization objective.
Let’s go through some examples.
The Sequences: we can definitely see Eliezer’s understanding of the general theory of agency in many places, especially when talking about Bayes and utility. (Engines of Cognition is a central example.) But most of the sequences talk about things like failure modes of human cognition, how to actually change your mind, social failure modes of human cognition, etc. It sure looks like the primary optimization objective is about better human thinking, plus some general philosophical foundations, not the elucidation of the general theory of agency.
Tiling agents and proof-based decision theories: I’m on board with the use of proof-based setups to make minimal assumptions about “the substrate that the agency is made of”. That’s an entirely reasonable choice, and it does look like that choice was driven (in large part) by a desire for the theory to apply quite generally. But these models don’t look like they were ever intended as general models of agency (I doubt they would apply nicely to e-coli); in your words, they provided “another system that is easy to say things about that could be used to triangulate agency in general”. That’s not necessarily a bad step on the road to general theory, but the general theory itself was not the main thing those models were doing. (Personally, I think we already have enough points to triangulate from for the time being. I think if someone were just directly, explicitly optimizing for a general theory of agency they’d probably come to that same conclusion. On the other hand, I could imagine someone very focused on self-reference barriers in particular might end up hunting for more data points, and it’s plausible that someone directly optimizing for a general theory of agency would end up focused mostly on self-reference.)
Grain of truth: similar to tiling agents and proof-based decision theories, this sounds like “another system that is easy to say things about that could be used to triangulate agency in general”. It does not sound like a part of the general theory of agency in its own right.
Logical induction: here we see something which probably would apply to an e-coli; it does sound like a part of a general theory of agency. (For the peanut gallery: I’m talking about LI criterion here, not the particular algorithm.) On the other hand, I wouldn’t expect it to say much of interest about an e-coli beyond what we already know from older coherence theorems. It’s still mainly of interest in problems of reflection. And I totally buy that reflection is an important bottleneck to the general theory of agency, but I wouldn’t expect to see such a singular focus on that one bottleneck if someone were directly optimizing for a general theory of agency as their primary objective.
Embedded agents: in your own words, you “started by taking the stuff that MIRI has already been working on, mostly the artifacts of the Benya Era, and trying to communicate the central justification that would cause one to be interested in these topics”. You did not start by taking all the different agenty systems you could think of, and trying to communicate the central concept that would cause one to be interested in those systems. I do think embedded agency came closer than any other example on this list to tackling the general theory of agency, but it still wasn’t directly optimizing for that as the primary objective.
Going down that list (and looking at your more recent work), it definitely looks like research has been more and more directly pointed at the general theory of agency over time. But it also looks like that was not the primary optimization objective over most of MIRI’s history, which is why I don’t think slow progress on agent foundations to date provides strong evidence that the field is very difficult. Conversely, I’ve seen firsthand how tractable things are when I do optimize directly for a general theory of agency, and based on that experience I expect fairly fast progress.
(Addendum for the peanut gallery: I don’t mean to bash any of this work; every single thing on the list was at least great work, and a lot of it was downright brilliant. There’s a reason I said MIRI is the best org at this kind of work. My argument is just that it doesn’t provide especially strong evidence that agent foundations are hard, because the work wasn’t directly optimizing for the general theory of agency as its primary objective.)
Hmm, yeah, we might disagree about how much reflection(self-reference) is a central part of agency in general.
It seems plausible that it is important to distinguish between the e-coli and the human along a reflection axis (or even more so, distinguish between evolution and a human). Then maybe you are more focused on the general class of agents, and MIRI is more focused on the more specific class of “reflective agents.”
Then, there is the question of whether reflection is going to be a central part of the path to (F/D)OOM.
Does this seem right to you?
To operationalize, I claim that MIRI has been directed at a close enough target to yours that you probably should update on MIRI’s lack of progress at least as much as you would if MIRI was doing the same thing as you, but for half as long.
Which isn’t *that* large an update. The average number of agent foundations researchers (That are public facing enough that you can update on their lack of progress) at MIRI over the last decade is like 4.
Figuring out how to factor in researcher quality is hard, but it seems plausible to me that the amount of quality adjusted attention directed at your subgoal over the next decade is significantly larger than the amount of attention directed at your subgoal over the last decade. (Which would not all come from you. I do think that Agent Foundations today is non-trivially closer to John today that Agent Foundations 5 years ago is to John today.)
It seems accurate to me to say that Agent Foundations in 2014 was more focused on reflection, which shifted towards embeddedness, and then shifted towards abstraction, and that these things all flow together in my head, and so Scott thinking about abstraction will have more reflection mixed in than John thinking about abstraction. (Indeed, I think progress on abstraction would have huge consequences on how we think about reflection.)
In case it is not obvious to people reading, I endorse John’s research program. (Which can maybe be inferred by the fact that I am arguing that it is similar to my own). I think we disagree about what is the most likely path after becoming less confused about agency, but that part of both our plans is yet to be written, and I think the subgoal is enough of a simple concept that I don’t think disagreements about what to do next to have a strong impact on how to do the first step.
This all sounds right.
In particular, for folks reading, I symmetrically agree with this part:
… i.e. I endorse Scott’s research program, mine is indeed similar, I wouldn’t be the least bit surprised if we disagree about what comes next but we’re pretty aligned on what to do now.
Also, I realize now that I didn’t emphasize it in the OP, but a large chunk of my “50/50 chance of success” comes from other peoples’ work playing a central role, and the agent foundations team at MIRI is obviously at the top of the list of people whose work is likely to fit that bill. (There’s also the whole topic of producing more such people, which I didn’t talk about in the OP at all, but I’m tentatively optimistic on that front too.)
That does seem right.
I do expect reflection to be a pretty central part of the path to FOOM, but I expect it to be way easier to analyze once the non-reflective foundations of agency are sorted out. There are good reasons to expect otherwise on an outside view—i.e. all the various impossibility results in logic and computing. On the other hand, my inside view says it will make more sense once we understand e.g. how abstraction produces maps smaller than the territory while still allowing robust reasoning, how counterfactuals naturally pop out of such abstractions, how that all leads to something conceptually like a Cartesian boundary, the relationship between abstract “agent” and the physical parts which comprise the agent, etc.
If I imagine what my work would look like if I started out expecting reflection to be the taut constraint, then it does seem like I’d follow a path a lot more like MIRI’s. So yeah, this fits.
One thing I’m still not clear about in this thread is whether you (John) would feel that progress has been made for the theory of agency if all the problems on which MIRI were instantaneously solved. Because there’s a difference between saying “this is the obvious first step if you believe reflection is the taut constraint” and “solving this problem would help significantly even if reflection wan’t the taut constraint”.
I expect that progress on the general theory of agency is a necessary component of solving all the problems on which MIRI has worked. So, conditional on those problems being instantly solved, I’d expect that a lot of general theory of agency came along with it. But if a “solution” to something like e.g. the Tiling Problem didn’t come with a bunch of progress on more foundational general theory of agency, then I’d be very suspicious of that supposed solution, and I’d expect lots of problems to crop up when we try to apply the solution in practice.
(And this is not symmetric: I would not necessarily expect such problems in practice for some more foundational piece of general agency theory which did not already have a solution to the Tiling Problem built into it. Roughly speaking, I expect we can understand e-coli agency without fully understanding human agency, but not vice-versa.)
I agree with this asymmetry.
One thing I am confused about is whether to think of the e-coli as qualitatively different from the human. The e-coli is taking actions that can be well modeled by an optimization process searching for actions that would be good if this optimization process output them, which has some reflection in it.
It feels like it can behaviorally be well modeled this way, but is mechanistically not shaped like this, I feel like the mechanistic fact is more important, but I feel like we are much closer to having behavioral definitions of agency than mechanistic ones.
I would say the e-coli’s fitness function has some kind of reflection baked into it, as does a human’s fitness function. The qualitative difference between the two is that a human’s own world model also has an explicit self-model in it, which is separate from the reflection baked into a human’s fitness function.
After that, I’d say that deriving the (probable) mechanistic properties from the fitness functions is the name of the game.
… so yeah, I’m on basically the same page as you here.