You are just normalizing on the dollar. You could ask “how many chickens would I kill to save a human life” instead, and you would normalize on a chicken.
Utility functions are invariant up to affine transformation. I don’t need to say how much I value a human life or how much I value a chicken life to make decisions in weird trolly problems involving humans and chickens. I only need to know relative values. However, utility uncertainty messes this up. Say I have two hypotheses: one in which human and chicken lives have the same value, and one in which humans are a million times more valuable. I assign the two hypotheses equal weight.
I could normalize and say that in both cases a human is worth 1 util. Then, when I average across utility functions, humans are about twice as valuable as chickens. But if I normalize and say that in both cases a chicken is worth 1 util, then when I average, the human is worth about 500,000 times as much. (You can still treat it like other uncertainty, but you have to make this normalization choice.)
I think it was wrong about the MtG post. I mostly think the negative effects of posting ideas (related to technical topics) that people think are bad is small enough to ignore, except in so far as it messes with my internal state. My system 2 thinks my system 1 is wrong about the external effects, but intends to cooperate with it anyway, because not cooperating with it could be internally bad.
As another example, months ago, you asked me to talk about how embedded agency fits in with the rest of AI safety, and I said something like that I didn’t want to force myself to make any public arguments for or against the usefulness of agent foundations. This is because I think research prioritization is especially prone to rationalization, so it is important to me that any thoughts about research prioritization are not pressured by downstream effects on what I am allowed to work on. (It still can change what I decide to work on, but only through channels that are entirely internal.)
So, I feel like I am concerned for everyone, including myself, but also including people who do not think that it would effect them. A large part of what concerns me is that the effects could be invisible.
For example, I think that I am not very effected by this, but I recently noticed a connection between how difficult it is to get to work on writing a blog post that I think it is good to write, and how much my system one expects some people to receive the post negatively. (This happened when writing the recent MtG post.) This is only anecdotal, but I think that posts that seems like bad PR caused akrasia, even when when controlling for how good I think the post is on net. The scary part is that there was a long time before I noticed this. If I believed that there was a credible way to detect when there are thoughts you can’t have in the first place, I would be less worried.
I didn’t have many data points, and the above connection might have been a coincidence, but the point I am trying to make is that I don’t feel like I have good enough introspective access to rule out a large, invisible, effect. Maybe others do have enough introspective access, but I do not think that just not seeing the outer incentives pulling on you is enough to conclude that they are not there.
I am not saying to falsely encourage him, I think I am mostly saying to continue giving him some attention/platform to get his ideas out in a way that would be heard. The real thing that I want is whatever will cause Bob to not end up back propagating from the group epistemics into his individual idea generation.
I apologize for using the phrase “epistemic status” in a way that disagrees with the accepted technical term.
I think informed oversight fits better with MtG white than it does with boxing. I agree that the three main examples are boxing like, and informed oversight is not, but it still feels white to me.
I do think that corrigibility done right is a thing that is in some sense less agentic. I think that things that have goals outside of them are less agentic than things that have their goals inside of them, but I think corrigibility is stronger than that. I want to say something like a corrigible agent not only has its goals partially on the outside (in the human), but also partially has its decision theory on the outside. Idk.
Abram and I submit Embedded Agency.
Yeah, it is just functions that take in two sentences and put both their Godel numbers into a fixed formula (with 2 inputs).
Thanks, I actually wanted to get rid of the earlier condition that f(x)≥x for all x, and I did that.