Do we have any data on p(doom) in the LW/rationalist community? I would guess the median is lower than 35-55%.
It’s not exactly clear where to draw the line, but I would guess this to be the case for, say, the 10% most active LessWrong users.
Do we have any data on p(doom) in the LW/rationalist community? I would guess the median is lower than 35-55%.
It’s not exactly clear where to draw the line, but I would guess this to be the case for, say, the 10% most active LessWrong users.
Not sure. I guess you also have to exclude policy gradient methods that make use of learned value estimates. “Learned evaluation vs sampled evaluation” is one way you could say it.
Model-based vs model-free does feel quite appropriate, shame it’s used for a narrower kind of model in RL. Not sure if it’s used in your sense in other contexts.
Under that definition you end up saying that what are usually called ‘model-free’ RL algorithms like Q-learning are model-based. E.g. in Connect 4, once you’ve learned that getting 3 in a row has a high value, you get credit for taking actions that lead to 3 in a row, even if you ultimately lose the game.
I think it is kinda reasonable to call Q-learning model-based, to be fair, since you can back out a lot of information about the world from the Q-values with little effort.
Nitpick: “odds of 63%” sounds to me like it means “odds of 63:100” i.e. “probability of around 39%”. Took me a while to realise this wasn’t what you meant.
I think the way to go, philosophically, might be to distinguish kindness-towards-conscious-minds and kindness-towards-agents. The former comes from our values, while the second may be decision theoretic.
People sometimes say it seems generally kind to help agents achieve their goals. But it’s possible there need be no relationship between a system’s subjective preferences (i.e. the world states it experiences as good) and its revealed preferences (i.e. the world states it works towards).
For example, you can imagine an agent architecture consisting of three parts:
a reward signal, experienced by a mind as pleasure or pain
a reinforcement learning algorithm
a wrapper which flips the reward signal before passing it to the RL algorithm.
This system might seek out hot stoves to touch while internally screaming. It would not be very kind to turn up the heat.
Even if you think a life’s work can’t make a difference but many can, you can still think it’s worthwhile to work on alignment for whatever reasons make you think it’s worthwhile to do things like voting.
(E.g. a non-CDT decision theory)
Since o1 I’ve been thinking that faithful chain-of-thought is waaaay underinvested in as a research direction.
If we get models such that a forward pass is kinda dumb, CoT is superhuman, and CoT is faithful and legible, then we can all go home, right? Loss of control is not gonna be a problem.
And it feels plausibly tractable.
I might go so far as to say it Pareto dominates most people’s agendas on importance and tractability. While being pretty neglected.
Gradual/Sudden
Do we know that the test set isn’t in the training data?
You can read examples of the hidden reasoning traces here.
But it’s not clear to me that in practice it would say naughty things, since it’s easier for the model to learn one consistent set of guidelines for what to say or not say than it is to learn two.
If I think about asking the model a question about a politically sensitive or taboo subject, I can imagine it being useful for the model to say taboo or insensitive things in its hidden CoT in the course of composing its diplomatic answer. The way they trained it may or may not incentivise using the CoT to think about the naughtiness of its final answer.
But yeah, I guess an inappropriate content filter could handle that, letting us see the full CoT for maths questions and hiding it for sensitive political ones. I think that does update me more towards thinking they’re hiding it for other reasons.
CoT optimised to be useful in producing the correct answer is a very different object to CoT optimised to look good to a human, and a priori I expect the former to be much more likely to be faithful. Especially when thousands of tokens are spent searching for the key idea that solves a task.
For example, I have a hard time imagining how the examples in the blog post could be not faithful (not that I think faithfulness is guaranteed in general).
If they’re avoiding doing RL based on the CoT contents,
Note they didn’t say this. They said the CoT is not optimised for ‘policy compliance or user preferences’. Pretty sure what they mean is the didn’t train the model not to say naughty things in the CoT.
‘We also do not want to make an unaligned chain of thought directly visible to users.’ Why?
I think you might be overthinking this. The CoT has not been optimised not to say naughty things. OpenAI avoid putting out models that haven’t been optimised not to say naughty things. The choice was between doing the optimising, or hiding the CoT.
Edit: not wanting other people to finetune on the CoT traces is also a good explanation.
But superhuman capabilities doesn’t seem to imply “applies all the optimisation pressure it can towards a goal”.
Like, being crazily good at research projects may require the ability to do goal-directed cognition. It doesn’t seem to require the habit of monomaniacally optimising the universe towards a goal.
I think whether or not a crazy good research AI is a monomaniacal universe optimiser probably depends on what kind of AI it is.
My second mistake was thinking that danger was related to the quantity of RL finetuning. I muddled up agency/goal-directedness with danger, and was also wrong that RL is more likely to produce agency/goal-directedness, conditioned on high capability. It’s a natural mistake, since stereotypical RL training is designed to incentivize goal-directedness. But if we condition on high capability, it wipes out that connection, because we already know the algorithm has to contain some goal-directedness.
Distinguish two notions of “goal-directedness”:
The system has a fixed goal that it capably works towards across all contexts.
The system is able to capably work towards goals, but which it does, if any, may depend on the context.
My sense is that a high level of capability implies (2) but not (1). And that (1) is way more obviously dangerous. Do you disagree?
Thanks for the feedback!
… except, going through the proof one finds that the latter property heavily relies on the “uniqueness” of the policy. My policy can get the maximum goal-directedness measure if it is the only policy of its competence level while being very deterministic. It isn’t clear that this always holds for the optimal/anti-optimal policies or always relaxes smoothly to epsilon-optimal/anti-optimal policies.
Yeah, uniqueness definitely doesn’t always hold for the optimal/anti-optimal policy. I think the way MEG works here makes sense: if you’re following the unique optimal policy for some utility function, that’s a lot of evidence for goal-directedness. If you’re following one of many optimal policies, that’s a bit less evidence—there’s a greater chance that it’s an accident. In the most extreme case (for the constant utility function) every policy is optimal—and we definitely don’t want to ascribe maximum goal-directedness to optimal policies there.
With regard to relaxing smoothly to epsilon-optimal/anti-optimal policies, from memory I think we do have the property that MEG is increasing in the utility of the policy for policies with greater than the utility of the uniform policy, and decreasing for policies with less than the utility of the uniform policy. I think you can prove this via the property that the set of maxent policies is (very nearly) just Boltzman policies with varying temperature. But I would have to sit down and think about it properly. I should probably add that to the paper if it’s the case.
minimum for uniformly random policy (this would’ve been a good property, but unless I’m mistaken I think the proof for the lower bound is incorrect, because negative cross entropy is not bounded below.)
Thanks for this. The proof is indeed nonsense, but I think the proposition is still true. I’ve corrected it to this.
Instead of tracking who is in debt to who, I think you should just track the extent to which you’re in a favouring-exchanging relationship with a given person. Less to remember and runs natively on your brain.
I gave $290. Partly because of the personal value I get out of LW, partly because I think it’s a solidly cost-effective donation.