axioman

Karma: 138

axioman 14 Jun 2019 20:05 UTC
1 point
on: Let’s talk about “Convergent Rationality”
“If the learner were a consequentialist with accuracy as its utility function, it would prefer to modify the test distribution in this way in order to increase its utility. Yet, even when given the opportunity to do so, typical gradient-based supervised learning algorithms do not seem to pursue such solutions (at least in my personal experience as an ML researcher).”
Can you give an example for such an opportunity being given but not taken?

axioman 23 Aug 2019 17:51 UTC
3 points
in reply to: Hjalmar_Wijk’s comment on: Tabooing ‘Agent’ for Prosaic Alignment
In light of this exchange, it seems like it would be interesting to analyze how much arguments for problematic properties of superintelligent utility-maximizing agents (like instrumental convergence) actually generalize to more general well-generalizing systems.

axioman 15 Dec 2019 19:07 UTC
1 point
on: Sections 1 & 2: Introduction, Strategy and Governance
“If players could commit to the terms of peaceful settlements and truthfully disclose private information necessary for the construction of a settlement (for instance, information pertaining to the outcome probability p in Example 1.1.1), the allocation of indivisible stakes could often be accomplished. Thus, the most plausible of Fearon’s rationalist explanations for war seem to be (1) the difficulty of credible commitment and (2) incomplete information (and incentives to misrepresent that information). ”
It seems plausible that if players could truthfully disclose private information and divide stakes, the ability to credibly commit would often not be needed. Would that in turn reduce the plausibility of explanation (1)?
I am curious whether there are some further arguments for the second sentence in the quote that were ommited to save space.

axioman 18 Dec 2019 16:35 UTC
1 point
in reply to: JesseClifton’s comment on: Sections 1 & 2: Introduction, Strategy and Governance
My reasoning relies more the divisibility of stakes (without having to resort to randomization). If there was a deterministic settlement that is preferable to conflict, then nobody has an incentive to break the settlement.
However, my main point was that I read the paragraph I quoted as “we don’t need the divisibility of stakes if we have credibility and complete information, therefore credibility and complete information is more important than divisibility of stakes”. I do not really find this line of argument convincing, as I am not convinced that you could not make the same argument with the role of credibility and divisible stakes reversed. Did I maybe misread what you are saying there?
Still, your conclusion still seems plausible and I suspect that you have other arguments for focusing on credibility. I would like to hear those.

axioman 20 Dec 2019 20:53 UTC
3 points
on: Sections 5 & 6: Contemporary Architectures, Humans in the Loop
It seems like replacing two agents A and B by a single agent that optimizes for their welfare function would avoid the issue of punishment. I guess that doing this might be feasible in some cases for artificial agents (as a single agent optimizing for the welfare function is a simpler object than the two-agent dynamics including punishment) and potentially understudied, as the solution seems harder to implement for humans (even though human solutions to collective action problems at least resemble the approach). One key problem might be finding a welfare function that both agents agree on, especially if there is information assymetry.
Any thought on this?
Edit: The approach seems to be most trivial when both agents share their world model and optimize for explicit utilities over this world model. More general, two principals with similar amounts of compute and similarly easily optimizable utility functions are most likely better off building an agent that optimizes for their welfare instead of two agents that need to learn to compete and cooperate. Optimizing for the welfare function applied to the agent’s value functions can be done by a somewhat straightforward modification of Q-learning or (in the case of differentiable welfare) policy gradient methods.

axioman 20 Dec 2019 22:00 UTC
1 point
on: When Goodharting is optimal: linear vs diminishing returns, unlikely vs likely, and other factors
“If our ideal reward functions have diminishing returns, this fact is explicitly included in the learning process.”
It seems like the exact shape of the diminishing returns might be quite hard to infer while wrong “rates” of diminishing returns can lead to (slighlty less severe versions of) the same problems as not modelling diminishing returns at all.
We probably at least need to incorporate our uncertainty about how returns diminish in some way. I am a bit confused about how to do this, as slowly diminishing functions will probably dominate if we just take an expectation over all candidates?

axioman 30 Dec 2019 17:42 UTC
1 point
in reply to: Stuart_Armstrong’s comment on: When Goodharting is optimal: linear vs diminishing returns, unlikely vs likely, and other factors
My model goes something like this: If increasing values requires using some resource, gaining access to more of the resource can be positive sum, while spending it is negative sum due to opportunity costs. In this model, the economy can be positive sum because it helps with alleviating resource constraints.
But maybe it does not really matter if most interactions are positive-sum until some kind of resource limit is reached and negative-sum only after?

axioman 31 Dec 2019 10:04 UTC
3 points
in reply to: Stuart_Armstrong’s comment on: When Goodharting is optimal: linear vs diminishing returns, unlikely vs likely, and other factors
Right. I think my intuition about negative-sum interactions under resource constrainrs combined the zero-sum nature of resource spending with the (perceived) negative-sum nature of competition for resources. But for a unified agent there is no competition for resources, so the argument for resource constraints leading to negative-sum interactions is gone.
Thank you for alleviating my confusion.

axioman 6 Jan 2020 8:07 UTC
1 point
in reply to: Stuart_Armstrong’s comment on: When Goodharting is optimal: linear vs diminishing returns, unlikely vs likely, and other factors
To clear up some more confusion: The sum-condition is not what actually matters here, is it? In the first example of 5), the sum of utilities is lower than in the second one. The problem in the second example seems to rather be that the best states for one of the (Edit: the expected) rewards are bad for the other?
That again seems like it would often follow from resource constraints.

axioman 9 Jan 2020 21:03 UTC
1 point
in reply to: Stuart_Armstrong’s comment on: When Goodharting is optimal: linear vs diminishing returns, unlikely vs likely, and other factors
But no matter, how I take the default outcome, your second example is always “more positive sum” than the first, because 0.5 + 0.7 + 2x < 1.5 − 0.1 +2x.
Granted, you could construct examples where the inequality is reversed and Goodhart bad corresponds to “more negative sum”, but this still seems to point to the sum-condition not being the central concept here. To me, it seems like “negative min” compared to the default outcome would be closer to the actual problem. This distinction matters, because negative min is a lot weaker than negative sum.
Or am I completely misunderstanding your examples or your point?

axioman 12 Jan 2020 11:26 UTC
10 points
in reply to: Stuart_Armstrong’s comment on: When Goodharting is optimal: linear vs diminishing returns, unlikely vs likely, and other factors
After looking at the update, my model is:
(Strictly) convex Pareto boundary: Extreme policies require strong beliefs. (Modulo some normalization of the rewards)
Concave (including linear) Pareto boundary: Extreme policies are favoured, even for moderate beliefs. (In this case, normalization only affects the “tipping point” in beliefs, where the opposite extreme policy is suddenly favoured).
In reality, we will often have concave and convex regions. The concave regions then cause more extreme policies for some beliefs, but the convex regions usually prevent the policy from completely focusing on a single objective.
From this lens, 1) maximum likelihood pushes us to one of the ends of the Pareto boundary, 2) an unlikely true reward pushes us close to the “bad” end, 3) Difficult optimization messes with normalisation (I am still somewhat confused about the exact role of normalization) and 4) Not accounting for diminishing returns bends the pareto boundary to become more concave.

axioman 13 Jan 2020 13:39 UTC
1 point
in reply to: Stuart_Armstrong’s comment on: When Goodharting is optimal: linear vs diminishing returns, unlikely vs likely, and other factors
I was thinking about normalisation as linearly rescaling every reward to $[0, 1]$ when I wrote the comment. Then, one can always look at $[0, 1]^{2}$ , which might make it easier to graphically think about how different beliefs lead to different policies. Different scales can then be translated to a certain reweighting of the beliefs (at least from the perspective of the optimal policy), as maximizing $P (R_{1}) S_{1} R_{1} + P (R_{2}) S_{2} R_{2}$ is the same as maximizing $\frac{P (R_{1}) S_{1}}{P (R_{1}) S_{1} + P (R_{2}) S_{2}} R_{1} + \frac{P (R_{2}) S_{2}}{P (R_{1}) S_{1} + P (R_{2}) S_{2}} R_{2}$

axioman 21 Jan 2020 14:16 UTC
1 point
in reply to: Davidmanheim’s comment on: Optimizing and Goodhart Effects Clarifying Thoughts—Parts 1 & 2
I found it a bit confusing that you first reffered to selection and control as types of optimizers and then (seemingly?) replaced selection by optimization in the rest of the text.

axioman 12 Feb 2020 15:05 UTC
4 points
in reply to: Stuart_Armstrong’s comment on: Attainable utility has a subagent problem
“Not quite… ” are you saying that the example is wrong, or that it is not general enough? I used a more specific example, as I found it easier to understand that way.
I am not sure I understand: In my mind “commitments to balance out the original agent’s attainable utility” essentially refers to the second agent being penalized by the the first agent’s penalty (although I agree that my statement is stronger). Regarding your text, my statement refers to “SA will just precommit to undermine or help A, depending on the circumstances, just sufficiently to keep the expected rewards the same. ”.
My confusion is about why the second agent is only mildy constrained by this commitment. For example, weakening the first agent would come with a big penalty (or more precisely, building another agent that is going to weaken it gives a large penalty to the original agent), unless it’s reversible, right?
The bit about multiple subagents does not assume that more than one of them is actually built. It rather presents a scenario where building intelligent subagents is automatically penalized. (Edit: under the assumption that building a lot of subagents is infeasible or takes a lot of time).

axioman 14 Feb 2020 8:00 UTC
1 point
on: Subagents and attainable utility in general
So the general problem is that large changes in $Q_{R} (s_{t + 1},$ ∅) are not penalized?

axioman 16 Feb 2020 22:43 UTC
4 points
on: On characterizing heavy-tailedness
Do you maybe have another example for action relevance? Nonfinite variance and finite support do not go well together.

axioman 27 Feb 2020 17:01 UTC
LW: 3 AF: 1
AF
on: How Low Should Fruit Hang Before We Pick It?
I do not understand your proof for proposition 2.

axioman 27 Feb 2020 17:53 UTC
1 point
AF
in reply to: TurnTrout’s comment on: How Low Should Fruit Hang Before We Pick It?
Where does
$u (¯ a') - \frac{I (¯ a')}{R_{1}} = u (¯ a) - \frac{I (¯ a)}{R_{2}}$
come from?
Also, the equation seems to imply
$R_{1} = R_{2}$
Edit: I focused too much on what I suppose is a typo. Clearly you can just rewrite the the first and last equality as equality of an affine linear function
$R \to R$
at two points, which gives you equality everywhere.

axioman 27 Feb 2020 20:44 UTC
LW: 3 AF: 1
AF
on: Attainable Utility Preservation: Scaling to Superhuman
I wonder what happens to the subagent problem with a random action as baseline: In the current sense, building a subagent roughly works by reaching a state $s_{t + 1}$ where
$Q_{R} (s_{t + 1}, \emptyset) = V_{R} (π^{*}, s_{t + 1})$
for all auxillary rewards $R$ , where $π^{*}$ is the optimal policy according to the main reward; while making sure that there exists an action $a_{R}$ such that
$R (t) + γ Q_{R} (s_{t + 1}, a_{R}) \approx Q_{R} (s_{t}, \emptyset)$
for every $R$ . So while building a subagent in that way is still feasible, the agent would be forced to either receive a large penalty or give the subagent random orders at $t + 1$ .
Probably, there is a way to circumvent this again, though? Also, I am unsure about the other properties of randomized baselines.

axioman 27 Feb 2020 22:06 UTC
1 point
in reply to: TurnTrout’s comment on: Attainable Utility Preservation: Scaling to Superhuman
For all auxillary rewards. Edited the original comment.
I agree that it is likely to go wrong somewhere, but it might still be useful to figure out why. If the agent is able to predict the randomness reliably in some cases, the random baseline does not seem to help with the subagent problem.
Edit: Randomization does not seem to help, as long as the actionset is large (as the agent can then arrange for most actions to make the subagent optimize the main reward).