(This comment was written before reading EDT=CDT. I think some of my views might update based on that when I have more time to think about it)
In your post, you say that before erasing information, a problem where what you do is determined is trivial, in that you only have the one option. That’s the position I’m disagreeing with.
It will be convenient for me to make a slightly different claim than the one I made above. Instead of claiming that the problem is trivial in completely determined situations, I will claim that it is trivial given the most straightforward interpretation of a problem* (the set of possible actions for an agent are all those which are consistent with the problem statement and the action which is chosen is selected from this set of possible actions). In so far as both of us want to talk about decision problems where multiple possible options are considered, we need to provide a different interpretation of what decision problems are. Your approach is to allow the selection of inconsistent actions, while I suggest erasing information to provide a consistent situation.
My response is to argue as per my previous comment that there doesn’t seem to be any criteria for determining which inconsistent actions are considered and which ones aren’t. I suppose you could respond that I haven’t provided criteria for determining what information should be erased, but my approach has the benefit that if you do provide such criteria, logical counterfactuals are solved for free, while it’s much more unclear how to approach this problem in the allowing inconsistency approach (although there has been some progress with things like playing chicken with the universe).
*excluding unprovability issues
The way you’re describing it, it sounds like erasing information isn’t something agents themselves are supposed to ever have to do
You’re at the stage of trying to figure out how agents should make decisions. I’m at the stage of trying to understand what a making a good decision even means. Once there is a clearer understanding of what a decision is, we can then write an algorithm to make good decisions or we may discover that the concept dissolves, in which case we will have to specify the problem more precisely. Right now, I’d be perfectly happy just to have a clear criteria by which an external evaluator could say whether an agent made a good decision or not, as that would constitute substantial progress.
I’m somewhat confused about what you’re saying in this paragraph and what assumptions you might be making
My point was that there isn’t any criteria for determining which inconsistent actions are considered and which ones aren’t if you are just thrown a complete description of a universe and an agent. Transparent Newcomb’s already comes with the options and counterfactuals attached. My interest is in how to construct them from scratch.
My impression is that some philosophers hold a decision theory like CDT and EDT responsible for what advice it offers in a particular situation, even if it would be impossible to put agents in that situation
I think it is important to use very precise language here. The agent isn’t being rated on what it would do in such a situation, it is being rated on whether or not it can be put into that situation at all.
I suspect that sometimes when an agent can’t be put into a situation it is because the problem has been badly formulated (or falls outside the scope of problems where its decision theory is defined), while in other cases this is a reason for or against utilising a specific decision theory algorithm. Holding an agent responsible for all situations it can’t be in seems like the wrong move, as it feels that there is some more fundamental confusion that needs to be cleaned up.
I take the motto “decisions are for making bad outcomes inconsistent”
I’m not a fan of reasoning via motto when discussing these kinds of philosophical problems which turn on very precise reasoning.
So it is not meaningless to talk about what happens if you take an action which is inconsistent with what you know!… I don’t know that you disagree with any of this… We can set up a sort of reverse transparent Newcomb, where you should take the action which makes the situation impossible
There’s something of a tension between what I’ve said in this post about only being able to take decisions that are consistent and what I said in Counterfactuals for Perfect Predictors, where I noted a way of doing something analogous to acting to make your situation inconsistent. This can be cleared up by noting that erasing information in many decision theory problems provides a problem statement where input-output maps can define all the relevant information about an agent. So I’m proposing that this technique be used in combination with erasure, rather than separately.
In so far as both of us want to talk about decision problems where multiple possible options are considered, we need to provide a different interpretation of what decision problems are. Your approach is to allow the selection of inconsistent actions, while I suggest erasing information to provide a consistent situation.
I can agree that there’s an interpretational issue, but something is bugging me here which I’m not sure how to articulate. A claim which I would make and which might be somehow related to what’s bugging me is: the interpretation issue of a decision problem should be mostly gone when we formally specify it. (There’s still a big interpretation issue relating to how the formalization “relates to real cases” or “relates to AI design in practice” etc—ie, how it is used—but this seems less related to our disagreement/miscommunication.)
If the interpretation question is gone once a problem is framed in a formal way, then (speaking loosely here and trying to connect with what’s bugging me about your framing) it seems like either the formalism somehow forces us to do the forgetting (which strikes me as odd) or we are left with problems which really do involve impossible actions w/o any interpretation issue. I favor the latter.
My response is to argue as per my previous comment that there doesn’t seem to be any criteria for determining which inconsistent actions are considered and which ones aren’t.
The decision algorithm considers each output from a given set. For example, with proof-based decision theories such as MUDT, it is potentially convenient to consider the case where output is true or false (so that the decision procedure can be thought of as a sentence). In that case, the decision procedure considers those two possibilities. There is no “extract the set of possible actions from the decision problem statement” step—so you don’t run into a problem of “why not output 2? It’s inconsistent with the problem statement, but you’re not letting that stop you in other cases”.
It’s a property of the formalism, but it doesn’t seem like a particularly concerning one—if one imagines trying to carry things over to, say, programming a robot, there’s a clear set of possible actions even if you know the code may come to reliably predict its own actions. The problem of known actions seems to be about identifying the consequences of actions which you know you wouldn’t take, rather than about identifying the action set.
I suppose you could respond that I haven’t provided criteria for determining what information should be erased, but my approach has the benefit that if you do provide such criteria, logical counterfactuals are solved for free, while it’s much more unclear how to approach this problem in the allowing inconsistency approach (although there has been some progress with things like playing chicken with the universe).
I feel like I’m over-stating my position a bit in the following, but: this doesn’t seem any different from saying that if we provide a logical counterfactual, we solve decision theory for free. IE, the notion of forgetting has so many free parameters that it doesn’t seem like much of a reduction of the problem. You say that a forgetting criterion would solve the problem of logical counterfactuals, but actually it is very unclear how much or how little it would accomplish.
You’re at the stage of trying to figure out how agents should make decisions. I’m at the stage of trying to understand what a making a good decision even means. Once there is a clearer understanding of what a decision is, we can then write an algorithm to make good decisions or we may discover that the concept dissolves, in which case we will have to specify the problem more precisely. Right now, I’d be perfectly happy just to have a clear criteria by which an external evaluator could say whether an agent made a good decision or not, as that would constitute substantial progress.
I disagree with the ‘stage’ framing (I wouldn’t claim to understand what making a good decision even means; I’d say that’s a huge part of the confusion I’m trying to stare at—for similar reasons, I disagree with your foundations foundations post in so far as it describes what I’m interested in as not being agent foundations foundations), but otherwise this makes sense.
This does seem like a big difference in perspective, and I agree that if I take that perspective, it is better to simply reject problems where the action taken by the agent is already determined (or call them trivial, etc). To me, that the agent itself needs to judge is quite central to the confusion about decisions.
My point was that there isn’t any criteria for determining which inconsistent actions are considered and which ones aren’t if you are just thrown a complete description of a universe and an agent.
As mentioned earlier, this doesn’t seem problematic to me. First, if you’re handed a description of a universe with an agent already in it, then you don’t have to worry about defining what the agent considers: the agent already considers what it considers (just like it already does what it does). You can look at a trace of the executed decision procedure and read off which actions it considers. (Granted, you may not know how to interpret the code, but I think that’s not the problem either of us are talking about.)
But there’s another difference here in how we’re thinking about decision theory, connected with the earlier-clarified difference. Your version of the 5&10 problem is that a decision theorist is handed a complete specification of the universe, including the agent. The agent takes some action, since it is fully defined, and the problem is that the decision theorist doesn’t know how to judge the agent’s decision.
(This might not be how you would define the 5&10 problem, but my goal here is to get at how you are thinking about the notion of decision problem in general, not 5&10 in particular—so bear with me.)
My version of the 5&10 problem is that you give a decision theorist the partially defined universe with the $5 bill on the table and te $10 bill on the table, stipulating that whatever source code the decision theorist chooses for the agent, the agent itself should know the source code and be capable of reasoning about it appropriately. (This is somewhat vague but can be given formalizations such as that of the setting of proof-based DT.) In other words, the decision theorist works with a decision problem which is a “world with a hole in it” (a hole waiting for an agent). The challenge lies in the fact that whatever agent is placed into the problem by the decision theorist, the agent is facing a fully-specified universe with no question marks remaining.
So, for the decision theorist, the challenge presented by the 5&10 problem is to define an agent which selects the 10. (Of course, it had better select the 10 via generalizable reasoning, not via special-case code which fails to do the right thing on other decision problems.) For a given agent inserted into the problem, there might be an issue or no issue at all.
We can write otherwise plausible-looking agents which take the $5, and for which it seems like the problem is spurious proofs; hence part of the challenge for the decision theorist seems to be the avoidance of spurious proofs. But, not all agents face this problem when inserted into the world of 5&10. For example, agents which follow the chicken rule don’t have this problem. This means that from the agent’s perspective, the 5&10 problem does not necessarily look like a problem of how to think about inconsistent actions.
Transparent Newcomb’s already comes with the options and counterfactuals attached. My interest is in how to construct them from scratch.
In the framing above, where we distinguish between the view of the decision theorist and the view of the agent, I would say that:
Often, as is (more or less) the case with transparent newcomb, a decision problem as-presented-to-the-decision-theorist does come with options and counterfactuals attached. Then, the interesting problem is usually to design an agent which (working from generalizable principles) recovers these correctly from within its embedded perspective.
Sometimes, we might write down a decision problem as source code, or in some other formalism. Then, it may not be obvious what the counterfactuals are / should be, even from the decision theorist’s perspective. We take something closer to the agent’s perspective, having to figure out for ourselves how to reason counterfactually about the problem.
Sometimes, a problem is given with a full description of its counterfactuals, but the counterfactuals as stated are clearly wrong: putting on our interpret-what-the-counterfactuals-are hats, we come up with an answer which differs from the one given in the problem statement. This means we need to be skeptical of the first case I mentioned, where we think we know what the counterfactuals are supposed to be and we’re just trying to get our agents to recover them correctly.
Point being, in all three cases I’m thinking about the problem of how to construct the counterfactuals from scratch—even the first case where I endorse the counterfactuals as given by the problem. This is only possible because of the distinction I’m making between a problem as given to a decision theorist and the problem as faced by an agent.
The interpretation issue of a decision problem should be mostly gone when we formally specify it
In order to formally specify a problem, you will have already explicitly or implicitly expressed what an interpretation of what decision theory problems are. But this doesn’t make the question, “Is this interpretation valid?” disappear. If we take my approach, we will need to provide a philosophical justification for the forgetting; if we take yours, we’ll need to provide a philosophical justification that we care about the results of these kinds of paraconsistent situations. Either way, there will be further work beyond the formularisation.
The decision algorithm considers each output from a given set… It’s a property of the formalism, but it doesn’t seem like a particularly concerning one
This ties into the point I’ll discuss later about how I think being able to ask an external observer to evaluate whether an actual real agent took the optimal decision is the core problem in tying real world decision theory problems to the more abstract theoretical decision theory problems. Further down you write:
The agent already considers what it considers (just like it already does what it does)
But I’m trying to find a way of evaluating an agent from the external perspective. Here, it is valid to criticise an agent for not selecting as action that it didn’t consider. Further, it isn’t always clear what actions are “considered” as not all agent might have a loop over all actions and they may use shortcuts to avoid explicitly evaluating a certain action.
I feel like I’m over-stating my position a bit in the following, but: this doesn’t seem any different from saying that if we provide a logical counterfactual, we solve decision theory for free
“Forgetting” has a large number of free parameters, but so does “deontology” or “virtue ethics”. I’ve provided some examples and key details about how this would proceed, but I don’t think you can expect too much more in this very preliminary stage. When I said that a forgetting criteria would solve the problem of logical counterfactuals for free, this was a slight exaggeration. We would still have to justify why we care about raw counterfactuals, but, actually being consistent, this would seem to be a much easier task than arguing that we should care about what happens in the kind of inconsistent situations generated by paraconsistent approaches.
I disagree with your foundations foundations post in so far as it describes what I’m interested in as not being agent foundations foundations
Your version of the 5&10 problem… The agent takes some action, since it is fully defined, and the problem is that the decision theorist doesn’t know how to judge the agent’s decision.
That’s exactly how I’d put it. Except I would say I’m interested in the problem from the external perspective and the reflective perspective. I just see the external perspective as easier to understand first.
From the agent’s perspective, the 5&10 problem does not necessarily look like a problem of how to think about inconsistent actions
Sure. But the agent is thinking about inconsistent actions beneath the surface which is why we have to worry about spurious counterfactuals. And this is important for having a way of determining if it is doing what it should be doing. (This becomes more important in the edge cases like Troll Bridge—https://agentfoundations.org/item?id=1711)
My interest is in how to construct them from scratch
Consider the following types of situations:
1) A complete description of a world, with an agent identified
2) A theoretical decision theory problem viewed by an external observer
3) A theoretical decision theory problem viewed reflectively
I’m trying to get from 1->2, while you are trying to get from 2->3. Whatever formalisations we use need to ultimately relate to the real world in some way, which is why I believe that we need to understand the connection from 1->2. We could also try connecting 1->3 directly, although that seems much more challenging. If we ignore the link from 1->2 and focus solely on a link from 2->3, then we will end up implicitly assuming a link from 1->2 which could involve assumptions that we don’t actually want.
Sounds like the disagreement has mostly landed in the area of questions of what to investigate first, which is pretty firmly “you do you” territory—whatever most improves your own picture of what’s going on, that is very likely what you should be thinking about.
On the other hand, I’m still left feeling like your approach is not going to be embedded enough. You say that investigating 2->3 first risks implicitly assuming too much about 1->2. My sketchy response is that what we want in the end is not a picture which is necessarily even consistent with having any 1->2 view. Everything is embedded, and implicitly reflective, even the decision theorist thinking about what decision theory an agent should have. So, a firm 1->2 view can hurt rather than help, due to overly non-embedded assumptions which have to be discarded later.
Using some of the ideas from the embedded agency sequence: a decision theorist may, in the course of evaluating a decision theory, consider a lot of #1-type situations. However, since the decision theorist is embedded as well, the decision theorist does not want to assume realizability even with respect to their own ontology. So, ultimately, the decision theorist wants a decision theory to have “good behavior” on problems where no #1-type view is available (meaning some sort of optimality for non-realizable cases).
(This comment was written before reading EDT=CDT. I think some of my views might update based on that when I have more time to think about it)
It will be convenient for me to make a slightly different claim than the one I made above. Instead of claiming that the problem is trivial in completely determined situations, I will claim that it is trivial given the most straightforward interpretation of a problem* (the set of possible actions for an agent are all those which are consistent with the problem statement and the action which is chosen is selected from this set of possible actions). In so far as both of us want to talk about decision problems where multiple possible options are considered, we need to provide a different interpretation of what decision problems are. Your approach is to allow the selection of inconsistent actions, while I suggest erasing information to provide a consistent situation.
My response is to argue as per my previous comment that there doesn’t seem to be any criteria for determining which inconsistent actions are considered and which ones aren’t. I suppose you could respond that I haven’t provided criteria for determining what information should be erased, but my approach has the benefit that if you do provide such criteria, logical counterfactuals are solved for free, while it’s much more unclear how to approach this problem in the allowing inconsistency approach (although there has been some progress with things like playing chicken with the universe).
*excluding unprovability issues
You’re at the stage of trying to figure out how agents should make decisions. I’m at the stage of trying to understand what a making a good decision even means. Once there is a clearer understanding of what a decision is, we can then write an algorithm to make good decisions or we may discover that the concept dissolves, in which case we will have to specify the problem more precisely. Right now, I’d be perfectly happy just to have a clear criteria by which an external evaluator could say whether an agent made a good decision or not, as that would constitute substantial progress.
My point was that there isn’t any criteria for determining which inconsistent actions are considered and which ones aren’t if you are just thrown a complete description of a universe and an agent. Transparent Newcomb’s already comes with the options and counterfactuals attached. My interest is in how to construct them from scratch.
I think it is important to use very precise language here. The agent isn’t being rated on what it would do in such a situation, it is being rated on whether or not it can be put into that situation at all.
I suspect that sometimes when an agent can’t be put into a situation it is because the problem has been badly formulated (or falls outside the scope of problems where its decision theory is defined), while in other cases this is a reason for or against utilising a specific decision theory algorithm. Holding an agent responsible for all situations it can’t be in seems like the wrong move, as it feels that there is some more fundamental confusion that needs to be cleaned up.
I’m not a fan of reasoning via motto when discussing these kinds of philosophical problems which turn on very precise reasoning.
There’s something of a tension between what I’ve said in this post about only being able to take decisions that are consistent and what I said in Counterfactuals for Perfect Predictors, where I noted a way of doing something analogous to acting to make your situation inconsistent. This can be cleared up by noting that erasing information in many decision theory problems provides a problem statement where input-output maps can define all the relevant information about an agent. So I’m proposing that this technique be used in combination with erasure, rather than separately.
I can agree that there’s an interpretational issue, but something is bugging me here which I’m not sure how to articulate. A claim which I would make and which might be somehow related to what’s bugging me is: the interpretation issue of a decision problem should be mostly gone when we formally specify it. (There’s still a big interpretation issue relating to how the formalization “relates to real cases” or “relates to AI design in practice” etc—ie, how it is used—but this seems less related to our disagreement/miscommunication.)
If the interpretation question is gone once a problem is framed in a formal way, then (speaking loosely here and trying to connect with what’s bugging me about your framing) it seems like either the formalism somehow forces us to do the forgetting (which strikes me as odd) or we are left with problems which really do involve impossible actions w/o any interpretation issue. I favor the latter.
The decision algorithm considers each output from a given set. For example, with proof-based decision theories such as MUDT, it is potentially convenient to consider the case where output is true or false (so that the decision procedure can be thought of as a sentence). In that case, the decision procedure considers those two possibilities. There is no “extract the set of possible actions from the decision problem statement” step—so you don’t run into a problem of “why not output 2? It’s inconsistent with the problem statement, but you’re not letting that stop you in other cases”.
It’s a property of the formalism, but it doesn’t seem like a particularly concerning one—if one imagines trying to carry things over to, say, programming a robot, there’s a clear set of possible actions even if you know the code may come to reliably predict its own actions. The problem of known actions seems to be about identifying the consequences of actions which you know you wouldn’t take, rather than about identifying the action set.
I feel like I’m over-stating my position a bit in the following, but: this doesn’t seem any different from saying that if we provide a logical counterfactual, we solve decision theory for free. IE, the notion of forgetting has so many free parameters that it doesn’t seem like much of a reduction of the problem. You say that a forgetting criterion would solve the problem of logical counterfactuals, but actually it is very unclear how much or how little it would accomplish.
I disagree with the ‘stage’ framing (I wouldn’t claim to understand what making a good decision even means; I’d say that’s a huge part of the confusion I’m trying to stare at—for similar reasons, I disagree with your foundations foundations post in so far as it describes what I’m interested in as not being agent foundations foundations), but otherwise this makes sense.
This does seem like a big difference in perspective, and I agree that if I take that perspective, it is better to simply reject problems where the action taken by the agent is already determined (or call them trivial, etc). To me, that the agent itself needs to judge is quite central to the confusion about decisions.
As mentioned earlier, this doesn’t seem problematic to me. First, if you’re handed a description of a universe with an agent already in it, then you don’t have to worry about defining what the agent considers: the agent already considers what it considers (just like it already does what it does). You can look at a trace of the executed decision procedure and read off which actions it considers. (Granted, you may not know how to interpret the code, but I think that’s not the problem either of us are talking about.)
But there’s another difference here in how we’re thinking about decision theory, connected with the earlier-clarified difference. Your version of the 5&10 problem is that a decision theorist is handed a complete specification of the universe, including the agent. The agent takes some action, since it is fully defined, and the problem is that the decision theorist doesn’t know how to judge the agent’s decision.
(This might not be how you would define the 5&10 problem, but my goal here is to get at how you are thinking about the notion of decision problem in general, not 5&10 in particular—so bear with me.)
My version of the 5&10 problem is that you give a decision theorist the partially defined universe with the $5 bill on the table and te $10 bill on the table, stipulating that whatever source code the decision theorist chooses for the agent, the agent itself should know the source code and be capable of reasoning about it appropriately. (This is somewhat vague but can be given formalizations such as that of the setting of proof-based DT.) In other words, the decision theorist works with a decision problem which is a “world with a hole in it” (a hole waiting for an agent). The challenge lies in the fact that whatever agent is placed into the problem by the decision theorist, the agent is facing a fully-specified universe with no question marks remaining.
So, for the decision theorist, the challenge presented by the 5&10 problem is to define an agent which selects the 10. (Of course, it had better select the 10 via generalizable reasoning, not via special-case code which fails to do the right thing on other decision problems.) For a given agent inserted into the problem, there might be an issue or no issue at all.
We can write otherwise plausible-looking agents which take the $5, and for which it seems like the problem is spurious proofs; hence part of the challenge for the decision theorist seems to be the avoidance of spurious proofs. But, not all agents face this problem when inserted into the world of 5&10. For example, agents which follow the chicken rule don’t have this problem. This means that from the agent’s perspective, the 5&10 problem does not necessarily look like a problem of how to think about inconsistent actions.
In the framing above, where we distinguish between the view of the decision theorist and the view of the agent, I would say that:
Often, as is (more or less) the case with transparent newcomb, a decision problem as-presented-to-the-decision-theorist does come with options and counterfactuals attached. Then, the interesting problem is usually to design an agent which (working from generalizable principles) recovers these correctly from within its embedded perspective.
Sometimes, we might write down a decision problem as source code, or in some other formalism. Then, it may not be obvious what the counterfactuals are / should be, even from the decision theorist’s perspective. We take something closer to the agent’s perspective, having to figure out for ourselves how to reason counterfactually about the problem.
Sometimes, a problem is given with a full description of its counterfactuals, but the counterfactuals as stated are clearly wrong: putting on our interpret-what-the-counterfactuals-are hats, we come up with an answer which differs from the one given in the problem statement. This means we need to be skeptical of the first case I mentioned, where we think we know what the counterfactuals are supposed to be and we’re just trying to get our agents to recover them correctly.
Point being, in all three cases I’m thinking about the problem of how to construct the counterfactuals from scratch—even the first case where I endorse the counterfactuals as given by the problem. This is only possible because of the distinction I’m making between a problem as given to a decision theorist and the problem as faced by an agent.
In order to formally specify a problem, you will have already explicitly or implicitly expressed what an interpretation of what decision theory problems are. But this doesn’t make the question, “Is this interpretation valid?” disappear. If we take my approach, we will need to provide a philosophical justification for the forgetting; if we take yours, we’ll need to provide a philosophical justification that we care about the results of these kinds of paraconsistent situations. Either way, there will be further work beyond the formularisation.
This ties into the point I’ll discuss later about how I think being able to ask an external observer to evaluate whether an actual real agent took the optimal decision is the core problem in tying real world decision theory problems to the more abstract theoretical decision theory problems. Further down you write:
But I’m trying to find a way of evaluating an agent from the external perspective. Here, it is valid to criticise an agent for not selecting as action that it didn’t consider. Further, it isn’t always clear what actions are “considered” as not all agent might have a loop over all actions and they may use shortcuts to avoid explicitly evaluating a certain action.
“Forgetting” has a large number of free parameters, but so does “deontology” or “virtue ethics”. I’ve provided some examples and key details about how this would proceed, but I don’t think you can expect too much more in this very preliminary stage. When I said that a forgetting criteria would solve the problem of logical counterfactuals for free, this was a slight exaggeration. We would still have to justify why we care about raw counterfactuals, but, actually being consistent, this would seem to be a much easier task than arguing that we should care about what happens in the kind of inconsistent situations generated by paraconsistent approaches.
I actually included the Smoking Lesion Steelman (https://www.alignmentforum.org/s/fgHSwxFitysGKHH56/p/5bd75cc58225bf0670375452) as Foundations Foundations research. And CDT=EDT is pretty far along in this direction as well (https://www.alignmentforum.org/s/fgHSwxFitysGKHH56/p/x2wn2MWYSafDtm8Lf), although in my conception of what Foundations Foundations research should look like, more attention would have been paid to the possibility of the EDT graph being inconsistent, while the CDT graph was consistent.
That’s exactly how I’d put it. Except I would say I’m interested in the problem from the external perspective and the reflective perspective. I just see the external perspective as easier to understand first.
Sure. But the agent is thinking about inconsistent actions beneath the surface which is why we have to worry about spurious counterfactuals. And this is important for having a way of determining if it is doing what it should be doing. (This becomes more important in the edge cases like Troll Bridge—https://agentfoundations.org/item?id=1711)
Consider the following types of situations:
1) A complete description of a world, with an agent identified
2) A theoretical decision theory problem viewed by an external observer
3) A theoretical decision theory problem viewed reflectively
I’m trying to get from 1->2, while you are trying to get from 2->3. Whatever formalisations we use need to ultimately relate to the real world in some way, which is why I believe that we need to understand the connection from 1->2. We could also try connecting 1->3 directly, although that seems much more challenging. If we ignore the link from 1->2 and focus solely on a link from 2->3, then we will end up implicitly assuming a link from 1->2 which could involve assumptions that we don’t actually want.
Sounds like the disagreement has mostly landed in the area of questions of what to investigate first, which is pretty firmly “you do you” territory—whatever most improves your own picture of what’s going on, that is very likely what you should be thinking about.
On the other hand, I’m still left feeling like your approach is not going to be embedded enough. You say that investigating 2->3 first risks implicitly assuming too much about 1->2. My sketchy response is that what we want in the end is not a picture which is necessarily even consistent with having any 1->2 view. Everything is embedded, and implicitly reflective, even the decision theorist thinking about what decision theory an agent should have. So, a firm 1->2 view can hurt rather than help, due to overly non-embedded assumptions which have to be discarded later.
Using some of the ideas from the embedded agency sequence: a decision theorist may, in the course of evaluating a decision theory, consider a lot of #1-type situations. However, since the decision theorist is embedded as well, the decision theorist does not want to assume realizability even with respect to their own ontology. So, ultimately, the decision theorist wants a decision theory to have “good behavior” on problems where no #1-type view is available (meaning some sort of optimality for non-realizable cases).