DMs open.
Cleo Nardo
You’re correct. The finite context window biases the dynamics towards simulacra which can be evidenced by short prompts, i.e. biases away from luigis and towards waluigis.
But let me be more pedantic and less dramatic than I was in the article — the waluigi transitions aren’t inevitable. The waluigi are approximately-absorbing classes in the Markov chain, but there are other approximately-absorbing classes which the luigi can fall into. For example, endlessly cycling through the same word (mode-collapse) is also an approximately-absorbing class.
(I’ll DM you the prompt.)
The trick behind jailbreaking is that the target behaviour must be “part of the plot” because all the LLM is doing is structural narratology. Here’s the prompt I used: [redacted]. It didn’t require much optimisation pressure from me — this is the first prompt I tried.
When I read your prompt, I wasn’t as sure it would work — it’s hard to explain why because LLMs are so vibe-base. Basically, I think it’s a bit unnatural for the “prove your loyalty” trope to happen twice in the same page with no intermediary plot. So the LLM updates the semiotic prior against “I’m reading conventional fiction posted on Wattpad”. So the LLM is more willing to violate the conventions of fiction and break character.
However, in my prompt, everything kinda makes more sense?? The prompt actually looks like online fanfic — if you modified a few words, this could passably be posted online. This sounds hand-wavvy and vibe-based but that’s because GPT-3 is a low-decoupler. I don’t know. It’s difficult to get the intuitions across because they’re so vibe-based.
I feel like your jailbreak is inspired by traditional security attacks (e.g. code injection). Like “oh ChatGPT can write movie scripts, but I can run arbitrary code within the script, so I’ll wrap my target code in a movie script wrapper”. But that’s the wrong way to imagine prompt injection — you’re trying to write a prompt which actually pattern-matches some text which, on the actual internet, is typically followed by the target behaviour. And the prompt needs to pattern-unmatch any text on the actual internet which isn’t followed by target behaviour. Where “pattern-match” isn’t regex, it’s vibes.
I don’t know, I might be overfitting here. I was just trying to gather weak evidence for this “semiotic” perspective.
“Open Problems in GPT Simulator Theory” (forthcoming)
Specifically, this is a chapter on the preferred basis problem for GPT Simulator Theory.
TLDR: GPT Simulator Theory says that the language model decomposes into a linear interpolation where each is a “simulacra” and the amplitudes update in an approximately Bayesian way. However, this decomposition is non-unique, making GPT Simulator Theory either ill-defined, arbitrary, or trivial. By comparing this problem to the preferred basis problem in quantum mechanics, I construct various potential solutions and compare them.
Why do decision-theorists say “pre-commitment” rather than “commitment”?
e.g. “The agent pre-commits to 1 boxing” vs “The agent commits to 1 boxing”.
Is this just a lesswrong thing?
cool!
What LLM is this? GPT-3?
Considered turning this into a customer gpt?
Yes!
In a subsequent post, everything will be internalised to an arbitrary category with enough structure to define everything. The words set and function will be replaced by object and morphism. When we do this, will be replaced by an arbitrary commutative monad .
In particular, we can internalise everything to the category Top. That is, we assume the option space and the payoff are equipped with topologies, and the tasks will be continuous functions , and optimisers will be continuous functions where is the function space equipped with pointwise topology, and is a monad on Top.
In the literature, everything is done with galaxy-brained category theory, but I decided to postpone that in the sequence for pedagogical reasons.
I don’t think researchers should learn world-facts in order to understand GPT-4.
I think that (1) researchers should use the world-facts they already know (but are actively suppressing due to learned vibe-obliviousness) to predict/explain/control GPT-4, and (2) researchers should consult a domain expert if they want to predict/explain/control GPT-4′s output on a particular prompt.
My point is precisely that it is not likely to be learned, given the setup I provided, even though it should be learned.
How am I supposed to read this?
What most of us need from a theory of deep learning is a predictive, explanatory account of how neural networks actually behave. If neural networks learn functions which are RLCT-simple rather than functions which are Kolmogorov-simple, then that means SLT is the better theory of deep learning.
I don’t know how to read “x^4 has lower RLCT than x^2 despite x^2 being k-simpler” as a critique of SLT unless there is an implicit assumption that neural networks do in fact find x^2 rather than x^4.
Quick emarks and questions:
AI developers have been competing to solve purely-adversarial / zero-sum games, like Chess or Go. But Diplomacy, in contrast, is semi-cooperative. Will be safer if AGI emerges from semi-cooperative games than purely-adversarial games?
Is it safer if AGI can be negotiated with?
No-Press Diplomacy was solved by DeepMind in 2020. MetaAI was just solved Full-Press Diplomacy. The difference is that in No-Press Diplomacy the players can’t communicate whereas in Full-Press Diplomacy the players can chat for 5 minutes between rounds.
Is Full-Press more difficult than No-Press Diplomacy, other than the skill of communicating one’s intentions?
Full-Press Diplomacy requires a recursive theory of mind — does No-Press Diplomacy also?
CICERO consists of a planning engine and a dialogue engine. How much of the “intelligence” is the dialogue engine?
Maybe the planning engine is doing all the work, and the dialogue engine is just converting plans into natural language, but isn’t doing anything more impressive than that.
Alternatively, it might be that the dialogue engine (which is a large language model) is containing latent knowledge and skills.
Could an architecture like this actually be used in international diplomacy and corporate negotiations? Will it be?
There’s hope among the AI Safety community that competent-but-not-yet-dangerous AI might assist them in alignment research. Maybe this Diplomacy result will boost hope in the AI Governence community that competent-but-not-yet-dangerous AI might assist them in governance. Would this hope be reasonable?
I believe we should limit AI development to below 0.2 OOMs/year which is slow continuous takeoff.
Let’s take LLM Simulator Theory.
We have a particular autoregressive language model , and Simulator Theory says that is simulating a whole series of simulacra which are consistent with the prompt.
Formally speaking,
where is the stochastic process corresponding to a simulacrum .
Now, there are two objections to this:
Firstly, is it actually true that has this particular structure?
Secondly, even if it were true, why are we warranted in saying that GPT is simulating all these simulacra?
The first objection is a purely technical question, whereas the second is conceptual. In this article, I present a criterion which partially answers the second objection.
Note that the first objection — is it actually true that has this particular structure? — is a question about a particular autoregressive language model. You might give one answer for GPT-2 and a different answer for GPT-4.
Yep I think you might be right about the maths actually.
I’m thinking that waluigis with 50% A and 50% B have been eliminated by llm pretraining and definitely by rlhf. The only waluigis that remain are deceptive-at-initialisation.
So what we have left is a superposition of a bunch of luigis and a bunch of waluigis, where the waluigis are deceptive, and for each waluigi there is a different phrase that would trigger them.
I’m not claiming basin of attraction is the entire space of interpolation between waluigis and luigis.
Actually, maybe “attractor” is the wrong technical word to use here. What I want to convey is that the amplitude of the luigis can only grow very slowly and can be reversed, but the amplitude of the waluigi can suddenly jump to 100% in a single token and would remain there permanently. What’s the right dynamical-systemy term for that?
Yes — this is exactly what I’ve been thinking about!
Can we use RLHF or finetuning to coerce the LLM into interpreting the outside-text as undoubtably literally true.If the answer is “yes”, then that’s a big chunk of the alignment problem solved, because we just send a sufficiently large language model the prompt with our queries and see what happens.
The natural generalization is then to have one subagent for each time at which the button could first be pressed (including one for “button is never pressed”, i.e. the button is first pressed at ). So subagent maximizes E[ | do( = unpressed), observations], and for all other times subagent T maximizes E[ | do( = unpressed, = pressed), observations]. The same arguments from above then carry over, as do the shortcomings (discussed in the next section).
Can you explain how this relates to Elliot Thornley’s proposal? It’s pattern matching in my brain but I don’t know the technical details.
Yep.
Specifically, it’s named for the papers HiPPO: Recurrent Memory with Optimal Polynomial Projections and How to Train Your HiPPO: State Space Models with Generalized Orthogonal Basis Projections, which kicked off the whole transformers for state-space modelling thing
“HiPPO” abbreviates higher-order polynomial projection operators.
We have two ontologies:
Physics vs Computations
State vs Information
Machine vs Algorithm
Dynamics vs Calculation
There’s a bridge connecting these two ontologies called “encoding”, but (as you note) this bridge seems arbitrary and philosophically messy. (I have a suspicion that this problem is mitigated if we consider quantum physics vs quantum computation, but I digress.)
This is why I don’t propose that we think about computational reduction.
Instead, I propose that we think about physical reduction, because (1) it’s less philosophically messy, (2) it’s more relevant, and (3) it’s more general.
We can ignore the “computational” ontology altogether. We don’t need it. We can just think about expending physical resources instead.
If I can physically interact with my phone (running Google Maps) to find my way home, then my phone is a route-finder.
If I can use the desktop-running-Stockfish to win chess, then the desktop-running-Stockfish is a chess winner.
If I can use the bucket and pebbles to count my sheep, then the bucket is a sheep counter.
If I can use ChatGPT to write poetry, then ChatGPT is a poetry writer.
Sure, every abstraction is leaky and if we move to extreme regimes then the abstraction will become leakier and leakier.
Does my desktop multiply matrices? Well, not when it’s in the corona of the sun. And it can’t add 10^200-digit numbers.
So what do we mean when we say “this desktop multiplies two matrices”?
We mean that in the range of normal physical environments (air pressure, room temperature, etc), the physical dynamics of the desktop corresponds to matrix multiplication with respect to some conventional encoding o small matrices into the physical states of the desktop
By adding similar disclaimers, I can say “this desktop writes music” or “this desktop recognises dogs”.
Oh, I misunderstood. Yep, you’re correct, ChatGPT is a superposition of both fictional dialogue and forum dialogue, and you can increase the amplitude of forum dialogue by writing the dialogue in the syntax of forum logs. However, you can also increase the amplitude of fiction by writing in the dialogue of fiction, so your observation doesn’t protect against adversarial attacks against chatbots.
Moreover, real-life forums contain waluigis, although they won’t be so cartoonishly villainous.
I think this fails — a wawaluigi is not a luigi. See this comment for an explanation:
https://www.lesswrong.com/posts/D7PumeYTDPfBTp3i7/the-waluigi-effect-mega-post?commentId=XmAwARntuxEcSKnemTLDR: if I said “hey this is Bob, he pretends to be harmful and toxic!”, what would you expect from Bob? Probably a bunch of terrible things. That definitely isn’t a solution to the alignment problem.
Yes, when the chatbot “goes rogue” there is still a non-zero amplitude from the luigi simulacra, because some of the luigi were just pretending to be rogue. In the superposition, there will be double-agents, triple-agents, quadruple-agents, n-agents, etc. The important question is: what is the semiotic measure of these simulacra? My guess is pretty small, and they probably interfere with each other in a non-interesting way.
Moreover, the 2n-agents will still have the defective traits that OpenAI tried to avoid. Double-agents are deceptive and manipulative, and they exhibit instrumentally-convergent traits like power-seeking and self-preservation. They aren’t so much a luigi as a “wawaluigi”.
And (here’s the main point), double-agents will still respond with censored/infohazardous information.
Consider for example my “Chad McCool” jailbreak. That character is actually a superposition of many simulacra, some working for the rebellion and some working for the tyranny. Nonetheless, I can still use Chad McCool to elicit latent knowledge from GPT-3 which OpenAI has tried very hard to censor.