Has anybody found these circuits? What evidence do we have that they exist? This sounds like a plausible theory, but your claim feels much stronger than my confidence level would permit — I have very little understanding of how LLMs work and most people who say they do seem wrong.
Going from “The LLM is doing a thing” to “The LLM has a circuit which does the thing” doesn’t feel obvious for all cases of things. But perhaps the definition of circuit is sufficiently broad, idk: (“A subgraph of a neural network.”)
But why do you think that Chad McCool rejecting the second question is a luigi, rather an a deceptive waluigi?
Being a luigi seems computationally easier than being a deceptive waluigi (similarly to how being internal aligned is faster than being deceptively aligned, see discussion of Speed here)
Almost all of ChatGPT’s behavior (across all the millions of conversations, though obviously the sample I have looked at is much smaller) lines up with “helpful assistant” so I should have a prior that any given behavior is more likely caused by that luigi rather than something else.
Those said, I’m probably in the ballpark of 90% confident that Chad is not a deceptive waluigi.
I agree completely. This is a plausible explanation, but it’s one of many plausible explanations and should not be put forward as a fact without evidence. Unfortunately, said evidence is impossible to obtain due to OpenAI’s policies regarding access to their models. When powerful RLHF models begin to be openly released, people can start testing theories like this meaningfully.
Has anybody found these circuits? What evidence do we have that they exist? This sounds like a plausible theory, but your claim feels much stronger than my confidence level would permit — I have very little understanding of how LLMs work and most people who say they do seem wrong.
Going from “The LLM is doing a thing” to “The LLM has a circuit which does the thing” doesn’t feel obvious for all cases of things. But perhaps the definition of circuit is sufficiently broad, idk: (“A subgraph of a neural network.”)
I don’t have super strong reasons here, but:
I have a prior toward simpler explanations rather than more complex ones.
Being a luigi seems computationally easier than being a deceptive waluigi (similarly to how being internal aligned is faster than being deceptively aligned, see discussion of Speed here)
Almost all of ChatGPT’s behavior (across all the millions of conversations, though obviously the sample I have looked at is much smaller) lines up with “helpful assistant” so I should have a prior that any given behavior is more likely caused by that luigi rather than something else.
Those said, I’m probably in the ballpark of 90% confident that Chad is not a deceptive waluigi.
I agree completely. This is a plausible explanation, but it’s one of many plausible explanations and should not be put forward as a fact without evidence. Unfortunately, said evidence is impossible to obtain due to OpenAI’s policies regarding access to their models. When powerful RLHF models begin to be openly released, people can start testing theories like this meaningfully.