but very critically it’s fine with modest reductions to risk with high probability over lower chances of completely eliminating risk
Where do you split the “risks” vs “probabilities of risks”?
These are the same object, and you are separating them; the lines you draw around “risks” as the primitive you’re trying to get to generalize, are not an actual thingy which will predictably generalize. Which is most of what I think we’re still disagreeing on.
A probability of risk is also a risk, and so is a probability of probabilities of [...] of risk.
You are correct that in reality, a probability of risk is equivalent to the bad event either happening certainly or the bad event not happening because the model is confused (I.e probabilities are 0 or 1 in reality, and the event either happens or doesn’t happen), but I was talking about the AI’s world models, which are probabilistic and the probability of risk concept is relevant.
I’m sorry if you got confused, I edited the comment to make it clear that the probabilities are in the AI’s world model, not in reality.
So there isn’t any generalization concern to worry about.
My confusion is about how you are engineering around the model’s confusion in a way which predictably generalizes at all.
Like, any task requires you to reason about a chain of instrumental decisions, and you’re engineering risk aversion… into the entire chain?
Every single inference step requires reasoning under uncertainty, and which steps you’re risk-averse about are not going to line up in a neat and actionable way. This holds in cases where the model has a much more similar ontology as well, because of it thinking more complex thoughts than you.
Your math treats risk, and probabilities in general, as something which can be exposed to a single discounting term, but RLAIF-augmented human oversight isn’t enough to overcome this.
To restate myself from earlier, “uncertainty about risk” is mathematically identical to “risk” and also “uncertainty about uncertainty about risk” etc. and your model blows up when presented with this.
(I’m not confidently saying that this shouldn’t be tried, but my median estimate of the difficulty of alignment goes down from “deriving algebraic geometry as a pre-agricultural human” to “doing the Apollo mission without transistors in 1960s America”. And I’m also heuristically worried about risk-aversion causing s-risks, but don’t have a strong argument for why that would occur, nor is that class of heuristics substantially influencing my thoughts on the math not applying here.)
Where do you split the “risks” vs “probabilities of risks”?
These are the same object, and you are separating them; the lines you draw around “risks” as the primitive you’re trying to get to generalize, are not an actual thingy which will predictably generalize. Which is most of what I think we’re still disagreeing on.
A probability of risk is also a risk, and so is a probability of probabilities of [...] of risk.
You are correct that in reality, a probability of risk is equivalent to the bad event either happening certainly or the bad event not happening because the model is confused (I.e probabilities are 0 or 1 in reality, and the event either happens or doesn’t happen), but I was talking about the AI’s world models, which are probabilistic and the probability of risk concept is relevant.
I’m sorry if you got confused, I edited the comment to make it clear that the probabilities are in the AI’s world model, not in reality.
So there isn’t any generalization concern to worry about.
My confusion is about how you are engineering around the model’s confusion in a way which predictably generalizes at all.
Like, any task requires you to reason about a chain of instrumental decisions, and you’re engineering risk aversion… into the entire chain?
Every single inference step requires reasoning under uncertainty, and which steps you’re risk-averse about are not going to line up in a neat and actionable way. This holds in cases where the model has a much more similar ontology as well, because of it thinking more complex thoughts than you.
Your math treats risk, and probabilities in general, as something which can be exposed to a single discounting term, but RLAIF-augmented human oversight isn’t enough to overcome this.
To restate myself from earlier, “uncertainty about risk” is mathematically identical to “risk” and also “uncertainty about uncertainty about risk” etc. and your model blows up when presented with this.
(I’m not confidently saying that this shouldn’t be tried, but my median estimate of the difficulty of alignment goes down from “deriving algebraic geometry as a pre-agricultural human” to “doing the Apollo mission without transistors in 1960s America”. And I’m also heuristically worried about risk-aversion causing s-risks, but don’t have a strong argument for why that would occur, nor is that class of heuristics substantially influencing my thoughts on the math not applying here.)