Yeah, I think “training for transparency” is fine if we can figure out good ways to do it. The problem is more training for other stuff (e.g. lack of certain types of thoughts) pushes against transparency.
mattmacdermott
I often complain about this type of reasoning too, but perhaps there is a steelman version of it.
For example, suppose the lock on my front door is broken, and I hear a rumour that a neighbour has been sneaking into my house at night. It turns out the rumour is false, but I might reasonably think, “The fact that this is so plausible is a wake-up call. I really need to change that lock!”
Generalising this: a plausible-but-false rumour can fail to provide empirical evidence for something, but still provide ‘logical evidence’ by alerting you to something that is already plausible in your model but that you hadn’t specifically thought about. Ideal Bayesian reasoners don’t need to be alerted to what they already find plausible, but humans sometimes do.
But then we have to ask — why two ‘ marks, to make the quotation mark? A quotidian reason: when you only use one, it’s an apostrophe. We already had the mark that goes in “don’t”, in “I’m”, in “Maxwell’s”; so two ’ were used to distinguish the quote mark from the existing apostrophe.
Incidentally I think in British English people normally do just use single quotes. I checked the first book I could find that was printed in the UK and that’s what it uses:
He’d be a fool to part with his vote for less than the amount of the benefits he gets.
Doesn’t seem right. Even assuming the person buying his vote wants to use it to remove his benefits, that one vote is unlikely to be the difference between the vote-buyer’s candidate winning and losing. The expected effect of the vote on the benefits is going to be much less than the size of the benefits.
An intuition you might be able to invoke is that the procedure they describe is like greedy sampling from an LLM, which doesn’t get you the most probable completion.
“A Center for Applied Rationality” works as a tagline but not as a name
We have a ~25% chance of extinction
Maybe add the implied ‘conditional on AI takeover’ to the conclusion so people skimming don’t come away with the wrong bottom line? I had to go back through the post to check whether this was conditional or not.
Fair enough yeah. But at least (1)-style effects weren’t strong enough to prevent any significant legislation in the near future.
Some evidence for (2) is that before the 1957 act no civil rights legislation had been passed for 82 years[1], and after it three more civil rights acts were passed in the next 11 years, including the Civil Rights Act of 1964, which in my understanding is considered very significant.
- ↩︎
Going off what’s listed in the wikipedia article on civil rights acts in the United States.
- ↩︎
I thought the post was fine and was surprised it was so downvoted. Even if people don’t agree with the considerations, or think all the most important considerations are missing, why should a post saying, “Here’s what I think and why I think it, feel free to push back in the comments,” be so poorly received? Commenters can just say what they think is missing.
Seems likely that it wouldn’t have been so downvoted if its bottom line was that AI risk is very high. Increases my P(LW groupthink is a problem) a bit.
Question marks and exclamation points go outside, unless they’re part of the sentence, and colons and semicolons always go outside.
An example of a sentence from the post that uses a colon is “It gets deeper than this, but the core problem remains the same”:
Surely nobody endorses putting the colon outside the quotes there! I feel like Opus/whoever is just assuming that people virtually never want to quote a piece of text that ends in a colon, rather than really wanting to endorse a different rule to the question mark case.
I’m confused by the “necessary and sufficient” in “what is the minimum necessary and sufficient policy that you think would prevent extinction?”
Who’s to say there exists a policy which is both necessary and sufficient? Unless we mean something kinda weird by “policy” that can include a huge disjunction (e.g. “we do any of the 13 different things I think would work”) or can be very vague (e.g “we solve half A of the problem and also half B of the problem”).
It would make a lot more sense in my mind to ask “what is a minimal sufficient policy that you think would prevent extinction”?
Usually lower numbers go on the left and bigger numbers go on the right (1, 2, 3,…) so seems reasonable to have it this way.
RL generalization is controlled by why the policy took an action
Is this that good a framing for these experiments? Just thinking out loud:
Distinguish two claims
what a model reasons about in its output tokens on the way to getting its answer affects how it will generalise
why a model produces its output tokens affects how it will generalise
These experiments seem to test (1), while the claim from your old RL posts is more like (2).
You might want to argue that the claims are actually very similar, but I suspect that someone who disagrees with the quoted statement would believe (1) despite not believing (2). To convince such people we’d have to test (2) directly, or argue that (1) and (2) are very similar.
As for whether the claims are very similar…. I’m actually not sure they are (I changed my mind while writing this comment).
Re: (1), it’s clear that when you get an answer right using a particular line of reasoning, the gradients point towards using that line of reasoning more often. But re: (2), the gradients point towards whatever is the easiest way to get you to produce those output tokens more often, which could in principle be via a qualitatively different computation to the one that actually caused you to output them this time.
So the two claims are at least somewhat different. (1) seems more strongly true than (2) (although I still believe (2) is likely true to a large extent).
But their setup adds:
1.5. Remove any examples in which the steering actually resulted in the desired behaviour.
which is why it’s surprising.
the core prompting experiments were originally done by me (the lead author of the paper) and I’m not an Anthropic employee. So the main results can’t have been an Anthropic PR play (without something pretty elaborate going on).
Well, Anthropic chose to take your experiments and build on and promote them, so that could have been a PR play, right? Not saying I think it was, just doubting the local validity.
Sorry. can’t remember. Something done virtually, maybe during Covid.
ELK = easy-to-hard generalisation + assumption the model already knows the hard stuff?
It’s the last sentence of the first paragraph of section 1.
I think “Will there be a crash?” is a much less ambiguous question than “Is there a bubble?”