Some tangentially related thoughts:
It seems that in many simple worlds (such as the Bomb world), an indexically-selfish agent with a utility function u over centered histories would prefer to commit to UDT with a utility function u′ over uncentered histories; where u′ is defined as the sum of all the “uncentered versions” of u (version i corresponds to u when the pointer is assumed to point to agent i).
Things seem to get more confusing in messy worlds in which the inability of an agent to define a utility function (over uncentered histories) that distinguishes between agent1 and agent2 does not entail that the two agents are about to make the same decision.
I agree. It seems that in that situation the person would be “rational” to choose Right.
I’m still confused about the “UDT is incompatible with this kind of selfish values” part. It seems that an indexically-selfish person—after failing to make a binding commitment and seeing the bomb—could still rationally commit to UDT from that moment on, by defining the utility s.t. only copies that found themselves in that situation (i.e. those who failed to make a binding commitment and saw the bomb) matter. That utility is a function over uncentered histories of the world, and would result in UDT choosing Right.
Now suppose the simulation is set up to see a bomb in Left. In that case, when I see a bomb in Left, I don’t know if I’m a simulation or a real person. If I was selfish in an indexical way, I would think something like “If I’m a simulation then it doesn’t matter what I choose. The simulation will end as soon as I make a choice so my choice is inconsequential. But if I’m a real person, choosing Left will cause me to be burned. So I should choose Right.”
It seems to me that even in this example, a person (who is selfish in an indexical way) would prefer—before opening their eyes—to make a binding commitment to choose left. If so, the “intuitively correct answer” that UDT is unable to give is actually just the result of a failure to make a beneficial binding commitment.
(I’m not a decision theorist)
FDT in any form will violate Guaranteed Payoffs, which should be one of the most basic constraints on a decision theory
Fulfilling the Guaranteed Payoffs principle as defined here seems to entail two-boxing in the Transparent Newcomb’s Problem, and generally not being able to follow through on precommitments when facing a situation with no uncertainty.
My understanding is that a main motivation for UDT (which FDT is very similar to?) is to get an agent that, when finding itself in a situation X, follows through on any precommitment that—before learning anything about the world—the agent would have wanted to follow through on when it is in situation X. Such a behavior would tend to violate the Guaranteed Payoffs principle, but would be beneficial for the agent?
Second, a system can be an “optimizer” in the sense that it optimizes its environment. A human is an optimizer in this sense, because we robustly take actions that push our environment in a certain direction. A reinforcement learning agent can also be thought of as an optimizer in this sense, but confined to whatever environment it is run in.
This definition of optimizer_2 depends on the definition of “environment”. It seems that for an RL agent you use the word “environment” to mean the formal environment as defined in RL. How do you define “environment”, for this purpose, in non-RL settings?
What should be considered the environment of a SAT solver, or an arbitrary mesa-optimizer that was optimized to be a SAT solver?
Rohin’s opinion: I agree that this is an important distinction to keep in mind. It seems to me that the distinction is whether the optimizer has knowledge about the environment: in canonical examples of the first kind of optimizer, it does not. If we somehow encoded the dynamics of the world as a SAT formula and asked a super-powerful SAT solver to solve for the actions that accomplish some goal, it would look like the second kind of optimizer.
It seems to me that a SAT solver can be arbitrarily competent at solving SAT problems without being the second kind of optimizer (i.e. without acting upon its environment to change it), even while it solves SAT problems that encode the dynamics of our world. For example, this seems to be the case for a SAT solver that is just a brute force search with arbitrarily large amount of computing power.
[EDIT: When writing this comment, I considered “the environment of a SAT solver” to be the world that contains the computer running the SAT solver. However, this seem to contradict what Joar had in mind in his post].
Ah, I agree (edited my comment above accordingly).
Interesting… it seems that this doesn’t necessarily happen if we use online gradient descent instead, because the loss gradient (computed for a single episode) ought to lead away from model parameters that would increase the loss for the current episode and reduce it for future episodes.
I agree, my reasoning above does not apply to gradient descent (I misunderstood this point before reading your comment).
I think it still applies to evolutionary algorithms (which might end up being relevant).
how can we think more generally about what kinds of learning algorithms will produce episodic optimizers vs cross-episodic optimizers?
Maybe learning algorithms that have the following property are more likely to yield models with “cross-episodic behavior”:
During training, a parameter’s value is more likely to persist (i.e. end up in the final model) if it causes behavior that is beneficial for future episodes.
Also, what name would you suggest for this problem, if not “inner alignment”?
Maybe “non-myopia” as Evan suggested.
Inner alignment—The ML training process may not produce a model that actually optimizes for what we intend for it to optimize for (namely minimizing loss for just the current episode, conditional on the current episode being selected as a training episode).
If the trained model tries to minimize loss in future episodes, it definitely seems dangerous, but I’m not sure that we should consider this an inner-alignment failure. In some sense we got the behavior that our episodic learning algorithm was optimizing for.
For example, consider the following episodic learning algorithm: At the end of each episode, if the model failed to achieve the episode’s goal its network parameters are completely randomized (and if it achieves the goal, the model is unchanged). If we run this learning algorithm for an arbitrarily long time, we should expect to end up with a model that behaves in a way that results in achieving the goal in every future episode (if such a model exists).
I think the following is potentially another remaining safety problem:
[EDIT: actually it’s an inner alignment problem, using the definition here]
Assuming the oracle cares only about minimizing the loss in the current episode—as defined by a given loss function—it might act in a way that will cause the invocation of many “luckier” copies of itself (ones that, with very high probability, output a value that gets the minimal loss, e.g. by “magically” finding that value stored somewhere in the model, or by running on very reliable hardware). In this scenario, the oracle does not intrinsically care about the other copies of itself; it just wants to maximize the probability that the current execution is one of those “luckier” copies.
I’m confused about why the “spontaneous meta-learning” in Ortega et al. is equivalent to (or a special case of?) mesa-optimization; which was also suggested in MIRI’s August 2019 Newsletter. My understanding of Ortega et al. is that “spontaneous meta-learning” describes a scenario in which training on a sequence from a single generator is equivalent to training on sequences from multiple generators. I haven’t seen them discuss this issue in the context of the trained model itself doing search/optimization.
To what extent do models care about their performance across episodes? If there exists a side-channel which only increases next-episode performance, under what circumstances will a model exploit such a thing?
If an agent is trained with an episodic learning scheme and ends up with a behavior that maximizes reward across episodes, I’m not sure we should consider this an inner alignment failure. In some sense, we got the behavior that our learning scheme was optimizing for. [EDIT: this is not true for all learning algorithms, e.g. gradient descent, see discussion here]
To quickly see this, imagine an episodic learning scheme where—at the end of each episode—if the agent failed to achieve the episode’s goal then its policy network parameters are completely randomized, and otherwise the agent is unchanged. Assuming we have infinite resources, if we run this learning scheme for an arbitrarily long time, we should expect to end up with an agent that tries to achieve goals in future episodes.
You’d have to memorize all the training data and labels too.
(just noting that same goes for a decision tree that isn’t small enough s.t. the human can memorize it)
Unless the human could memorize the initialization parameters, they would be using a different neural network to classify.
Why wouldn’t the “human-trained network” be identical to the “original network”? [EDIT: sorry, missed your point. If the human knows the logic of the random number generator that was used to initialize the parameters of the original network, they can manually run the same logic themselves.]
By the same logic, any decision tree that is too large for a human to memorize does not allow theory simulatability as defined in the OP.
I’m still not sure about the distinction. A human with an arbitrarily large amount of time & paper could “train” a new NN (instead of working to “extract a decision tree”), and then “use” that NN.
Also, as a terminological note, I’ve taken to using “optimizer” for optimizer_1 and “agent” for something closer to optimizer_2, where I’ve been defining an agent as an optimizer that is performing a search over what its own action should be.
I’m confused about this part. According to this definition, is “agent” a special case of optimizer_1? If so it doesn’t seem close to how we might want to define a “consequentialist” (which I think should capture some programs that do interesting stuff other than just implementing [a Turing Machine that performs well on a formal optimization problem and does not do any other interesting stuff]).
Maybe we’re just not using the same definitions, but according to the definitions in the OP as I understand them, a box might indeed contain an arbitrarily strong optimizer_1 while not containing an optimizer_2.
For example, suppose the box contains an arbitrarily large computer that runs a brute-force search for some formal optimization problem. [EDIT: for some optimization problems, the evaluation of a solution might result in the execution of an optimizer_2]
It seems useful to have a quick way of saying:
“The quarks in this box implement a Turing Machine that [performs well on the formal optimization problem P and does not do any other interesting stuff]. And the quarks do not do any other interesting stuff.”
(which of course does not imply that the box is safe)
Meta: I think there’s an attempt to deprecate the term “inner optimizer” in favor of “mesa-optimizer” (which I think makes sense when the discussion is not restricted to a subsystem within an optimized system).
Looking at the “regular” Twitter feed seems as dangerous for one’s productivity as looking at Facebook’s feed. Market incentives require Twitter to make their users spend as much time as possible on their platform (using the best ML models they can train for that purpose).
A safer way to use Twitter is to create a very short list of Twitter accounts (the accounts with the largest EV/tweet), and then regularly going over the complete “feed” of just that list—sorted chronologically (not giving Twitter any say in what you see).