As Erich_Grunewald says, it usually shows you when Claude has used a search tool, and in this case I told it not to use search and it didn’t show me any usage. But it was so impressive that I’m at like 10% that there’s some secret hidden search tool type thing that explains it.
mattmacdermott
Wow, I just asked it about the details of a fairly obscure 11-citation paper of mine from 2024 and it has memorised ~all the technical details and could give a sentence-for-sentence paraphrase of large chunks of the paper. Strange experience, I recommend people try it out with their own obscure writings.
Sorry if this is obvious but do you have anything written in ‘instructions for Claude’ in your settings? If so, it’s still visible to Claude in incognito mode.
In my experience even smart people need a bit of conversational back and forth to understand the setup for Newcomb’s problem, so I’m a bit skeptical that most survey respondents can grok it from a concise description.
I wonder how the results would be different if a human or AI did enough back and forth with each participant to make sure they understand the setup.
What would you call the thing where it turns out that all my morality-flavoured wants are nonsense, but all my selfish-flavoured wants still make sense? Can we sub that term into the OP in place of ‘nihilism’?
Or do you deny the premise that there are different flavours? I personally really feel like I can taste the flavours.
If someone ran the sim for entertainment, they’d obviously sell that info to the acausal trade folks
Weak argument? The set of times that people incidentally produce information relevant to other people is much broader than the set of times they sell it to them.
Thanks for the positive reinforcement!
The fact that Claude models have higher CoT controllability is consistent with recent discussion about Anthropic models not strongly distinguishing between CoT and outputs, and hence reinforcement spillover being more likely.
(Although it strikes me now that the causality between reinforcement spillover and not strongly distinguishing between CoT and outputs could go in either direction).
I don’t have a take on the empirical evidence here, but maybe things like this could be caused by “negative inoculation prompting”.
In inoculation prompting, you tell the model during training that it’s ok to do bad thing X, in the hopes that if you accidentally reward X then the model learns “do X when told it’s ok” rather than “do X”.
Depending on how constitutional training is done, we could be teaching the model some version of “don’t do X when told not to by the constitution” or “don’t do X because the constitution says not to” rather than teaching it not to want to do X.
“Agent” is sort of similar. The top definition on google is “a person who acts on behalf of another person or group”, whereas in these parts we tend to use it for a thing that has its own goals.
Edit: I think Jon Richens pointed this out to me once.
Ah, confusing. Looks like Ben’s comment quotes a post by Holden from 15th Dec which in turn quotes a different post by Holden from 22nd Aug, which is where the original statement about takeover odds appears.
Both posts are hyperlinked in the comment and I had clicked the latter link without noticing, but in any case yes, seems like the original statement is from pre-ChatGPT.
Edit: just realised Ben had already commented to the same effect below.
lol sorry
How about a react for “that answers my question”? People seem to use thumbs up or thanks, but both suggest approval of the answer even if that’s not intended.
Do you find Daniel Kokotajlo’s subsequent work advocating for short timelines valuable? I ask because I believe that he sees/saw his work as directly building on Cotra’s[1].
I think the bar for work being a productive step in the conversation is lower than the bar for it turning out to be correct in hindsight or even its methodology being highly defensible at the time.
Is your position more, “Producing such a model was a fine and good step in the conversation, but OP mistakenly adopted it to guide their actions,” or “Producing such a model was always going to have been a poor move”?
- ^
I remember a talk in 2022 where he presented an argument for 10 year timelines, saying, “I stand on the shoulders of Ajeya Cotra”, but I’m on mobile and can’t hunt down a source. Maybe @Daniel Kokotajlo can confirm or disconfirm.
- ^
Is it? The date on the linked post is 29th Aug 2022, but ChatGPT was November or December.
I don’t think a one dimensional vote is necessarily just a compression of the comment, because weighing all the points made in the comment against each other is an extra step that the commenter might not have done in the text.
E.g the comment might list two ways the post is good and two ways it’s bad but not say or even imply whether or not the bad outweighs the good. The commenter might not even have decided. The number forces them to decide and say.
My first suggestion is to not use the phrase if they don’t know it and just say “points for making a correct prediction”.
But if you do want to link them to something you could send this slight edit of what I wrote elsewhere in the thread:
According to Bayes’ theorem, if we’ve just seen some evidence which Hypothesis A predicted with twice as much probability as Hypothesis B, then the probability of Hypothesis A grows by a factor of two relative to Hypothesis B. This doubling adds one bit in logspace, and we can think of this bit as a point scored by Hypothesis A.
By analogy, if Alice predicted the evidence with twice as much probability as Bob, we can pretend we’re scoring people like hypotheses and give Alice one ‘Bayes point’. If Alice and Bob each subscribe to a fixed hypothesis then this is not even an analogy, we’re just Bayesian updating about their hypotheses.
I’ve always interpreted it more literally.
Like, if we’ve just seen some evidence which Hypothesis A predicted with twice as much probability as Hypothesis B, then the probability of Hypothesis A grows by a factor of two relative to Hypothesis B. This doubling adds one bit in logspace, and we can think of this bit as a point scored by Hypothesis A.
By analogy, if Alice predicted the evidence with twice as much probability as Bob, we can pretend we’re scoring people like hypotheses and give Alice one ‘Bayes point’. If Alice and Bob each subscribe to a fixed hypothesis about How Stuff Works then this is not even an analogy, we’re just Bayesian updating about their hypotheses.
Yep that’s very plausible. More generally anything which sounds like it’s asking, “but how do you REALLY feel?” sort of implies the answer should be negative.
My take on the difference is that the “worst-case” ones are created for the purposes of studying safety techniques (control protocols, alignment training, auditing techniques), whereas the aspiration for the “constructed” ones is to learn about some potential propensity of models.