“If, at some point in the future, we have the same number of contagious people, and are not at an appreciable fraction of group immunity, it will at that point again be a solid decision to go into quarantine (or to extend it). ”
I think for many people the number of infections at which this becomes a good idas has increased as we have more accurate information about the CFR and how quickly realistic countermeasures can slow down an outbreak in a given area, which should decrease credence in some of the worst case scenarios many were worried about a few months ago.
“Czech Researchers claim that Chinese do not work well ”
This seems to be missing a word ;)
Nitpick: I am pretty sure non-zero-sum does not imply a convex Pareto front.
Instead of the lens of negotiation position, one could argue that mistake theorists believe that the Pareto Boundary is convex (which implies that usually maximizing surplus is more important than deciding allocation), while conflict theorists see it as concave (which implies that allocation is the more important factor).
Twitter: CV kills via cardiac failure, not pulmonary links to the aggragate spreadsheet, not the twitter soruce.
Even if the claim was usually true on longer time scales, I doubt that pointing out an organisations mistakes and not entirely truthful statements usually increases the trust in them on the short time scales that might be most important here. Reforming organizations and rebuilding trust usually takes time.
“One of the problems here is that the impact penalty only looks at the value of VAR one turn ahead. In the DeepMind paper, they addressed similar issues by doing “inaction rollouts”. I’ll look at the more general situations of π0 rollouts: rollouts for any policy π0. ”
“That’s the counterfactual situation, that zeroes out the impact penalty. What about the actual situation? Well, as we said before, A will be just doing ∅; so, as soon as π0 would produce anything different from ∅, the A becomes completely unrestrained again.”
fit together? In the special case where π0 is the inaction policy, I don’t understand how the trick would work.
For all auxillary rewards. Edited the original comment.
I agree that it is likely to go wrong somewhere, but it might still be useful to figure out why. If the agent is able to predict the randomness reliably in some cases, the random baseline does not seem to help with the subagent problem.
Edit: Randomization does not seem to help, as long as the actionset is large (as the agent can then arrange for most actions to make the subagent optimize the main reward).
I wonder what happens to the subagent problem with a random action as baseline: In the current sense, building a subagent roughly works by reaching a state st+1 where
for all auxillary rewards R, where π∗ is the optimal policy according to the main reward; while making sure that there exists an action aR such that
for every R. So while building a subagent in that way is still feasible, the agent would be forced to either receive a large penalty or give the subagent random orders at t+1.
Probably, there is a way to circumvent this again, though? Also, I am unsure about the other properties of randomized baselines.
Also, the equation seems to imply
Edit: I focused too much on what I suppose is a typo. Clearly you can just rewrite the the first and last equality as equality of an affine linear function
at two points, which gives you equality everywhere.
I do not understand your proof for proposition 2.
Do you maybe have another example for action relevance? Nonfinite variance and finite support do not go well together.
So the general problem is that large changes in QR(st+1,∅) are not penalized?
“Not quite… ” are you saying that the example is wrong, or that it is not general enough? I used a more specific example, as I found it easier to understand that way.
I am not sure I understand: In my mind “commitments to balance out the original agent’s attainable utility” essentially refers to the second agent being penalized by the the first agent’s penalty (although I agree that my statement is stronger). Regarding your text, my statement refers to “SA will just precommit to undermine or help A, depending on the circumstances, just sufficiently to keep the expected rewards the same. ”.
My confusion is about why the second agent is only mildy constrained by this commitment. For example, weakening the first agent would come with a big penalty (or more precisely, building another agent that is going to weaken it gives a large penalty to the original agent), unless it’s reversible, right?
The bit about multiple subagents does not assume that more than one of them is actually built. It rather presents a scenario where building intelligent subagents is automatically penalized. (Edit: under the assumption that building a lot of subagents is infeasible or takes a lot of time).
I found it a bit confusing that you first reffered to selection and control as types of optimizers and then (seemingly?) replaced selection by optimization in the rest of the text.
I was thinking about normalisation as linearly rescaling every reward to [0,1] when I wrote the comment. Then, one can always look at [0,1]2, which might make it easier to graphically think about how different beliefs lead to different policies. Different scales can then be translated to a certain reweighting of the beliefs (at least from the perspective of the optimal policy), as maximizing P(R1)S1R1+P(R2)S2R2 is the same as maximizing P(R1)S1P(R1)S1+P(R2)S2R1+P(R2)S2P(R1)S1+P(R2)S2R2
After looking at the update, my model is:
(Strictly) convex Pareto boundary: Extreme policies require strong beliefs. (Modulo some normalization of the rewards)
Concave (including linear) Pareto boundary: Extreme policies are favoured, even for moderate beliefs. (In this case, normalization only affects the “tipping point” in beliefs, where the opposite extreme policy is suddenly favoured).
In reality, we will often have concave and convex regions. The concave regions then cause more extreme policies for some beliefs, but the convex regions usually prevent the policy from completely focusing on a single objective.
From this lens, 1) maximum likelihood pushes us to one of the ends of the Pareto boundary, 2) an unlikely true reward pushes us close to the “bad” end, 3) Difficult optimization messes with normalisation (I am still somewhat confused about the exact role of normalization) and 4) Not accounting for diminishing returns bends the pareto boundary to become more concave.
But no matter, how I take the default outcome, your second example is always “more positive sum” than the first, because 0.5 + 0.7 + 2x < 1.5 − 0.1 +2x.
Granted, you could construct examples where the inequality is reversed and Goodhart bad corresponds to “more negative sum”, but this still seems to point to the sum-condition not being the central concept here. To me, it seems like “negative min” compared to the default outcome would be closer to the actual problem. This distinction matters, because negative min is a lot weaker than negative sum.
Or am I completely misunderstanding your examples or your point?
To clear up some more confusion: The sum-condition is not what actually matters here, is it? In the first example of 5), the sum of utilities is lower than in the second one. The problem in the second example seems to rather be that the best states for one of the (Edit: the expected) rewards are bad for the other?
That again seems like it would often follow from resource constraints.
Right. I think my intuition about negative-sum interactions under resource constrainrs combined the zero-sum nature of resource spending with the (perceived) negative-sum nature of competition for resources. But for a unified agent there is no competition for resources, so the argument for resource constraints leading to negative-sum interactions is gone.
Thank you for alleviating my confusion.