I spend time with my budding family, and think about AI alignment.
Some other links at payor.io
I spend time with my budding family, and think about AI alignment.
Some other links at payor.io
With respect to AGI-grade stuff happening inside the text-prediction model (which might be what you want to “RLHF” out?):
I think we have no reason to believe that these post-training methods (be it finetuning, RLHF, RLAIF, etc) modify “deep cognition” present in the network, rather than updating shallower things like “higher prior on this text being friendly” or whatnot.
I think the important points are:
These techniques supervise only the text output. There is no direct contact with the thought process leading to that output.
They make incremental local tweaks to the weights that move in the direction of the desired text.
Gradient descent prefers to find the smallest changes to the weights that yield the result.
Evidence in favor of this is the difficulty of eliminating “jailbreaking” with these methods. Each jailbreak demonstrates that a lot of the necessary algorithms/content are still in there, accessible by the network whenever it deems it useful to think that way.
Spinoza suggested that we first passively accept a proposition in the course of comprehending it, and only afterward actively disbelieve propositions which are rejected by consideration.
Some distinctions that might be relevant:
Parsing a proposition into your ontology, understanding its domains of applicability, implications, etc.
Having a sense of what it might be like for another person to believe the proposition, what things it implies about how they’re thinking, etc.
Thinking the proposition is true, believing its implications in the various domains its assumptions hold, etc.
If you ask me for what in my experience corresponds to a feeling of “passively accepting a proposition” when someone tells me, I think I’m doing a bunch of (1) and (2). This does feel like “accepting” or “taking in” the proposition, and can change how I see things if it works.
Awesome, thanks for writing this up!
I very much like how you are giving a clear account for a mechanism like “negative reinforcement suppresses text by adding contextual information to the model, and this has more consequences than just suppressing text”.
(In particular, the model isn’t learning “just don’t say that”, it’s learning “these are the things to avoid saying”, which can make it easier to point at the whole cluster?)
I tried to formalize this, using as a “poor man’s counterfactual”, standing in for “if Alice cooperates then so does Bob”. This has the odd behaviour of becoming “true” when Alice defects! You can see this as the counterfactual collapsing and becoming inconsistent, because its premise is violated. But this does mean we need to be careful about using these.
For technical reasons we upgrade to , which says “if Alice cooperates in a legible way, then Bob cooperates back”. Alice tries to prove this, and legibly cooperates if so.
This setup gives us “Alice legibly cooperates if she can prove that, if she legibly cooperates, Bob would cooperate back”. In symbols, .
Now, is this okay? What about proving ?
Well, actually you can’t ever prove that! Because of Lob’s theorem.
Outside the system we can definitely see cases where is unprovable, e.g. because Bob always defects. But you can’t prove this inside the system. You can only prove things like “” for finite proof lengths .
I think this is best seen as a consequence of “with finite proof strength you can only deny proofs up to a limited size”.
So this construction works out, but perhaps just because two different weirdnesses are canceling each other out. But in any case I think the underlying idea, “cooperate if choosing to do so leads to a good outcome”, is pretty trustworthy. It perhaps deserves to be cached out in better provability math.
(Thanks also to you for engaging!)
Hm. I’m going to take a step back, away from the math, and see if that makes things less confusing.
Let’s go back to Alice thinking about whether to cooperate with Bob. They both have perfect models of each other (perhaps in the form of source code).
When Alice goes to think about what Bob will do, maybe she sees that Bob’s decision depends on what he thinks Alice will do.
At this junction, I don’t want Alice to “recurse”, falling down the rabbit hole of “Alice thinking about Bob thinking about Alice thinking about—” and etc.
Instead Alice should realize that she has a choice to make, about who she cooperates with, which will determine the answers Bob finds when thinking about her.
This manouvre is doing a kind of causal surgery / counterfactual-taking. It cuts the loop by identifying “what Bob thinks about Alice” as a node under Alice’s control. This is the heart of it, and imo doesn’t rely on anything weird or unusual.
For the setup , it’s bit more like: each member cooperates if they can prove that a compelling argument for “everyone cooperates” is sufficient to ensure “everyone cooperates”.
Your second line seems right though! If there were provably no argument for straight up “everyone cooperates”, i.e. , this implies and therefore , a contradiction.
--
Also I think I’m a bit less confused here these days, and in case it helps:
Don’t forget that “” means “a proof of any size of ”, which is kinda crazy, and can be responsible for things not lining up with your intuition. My hot take is that Lob’s theorem / incompleteness says “with finite proof strength you can only deny proofs up to a limited size, on pain of diagonalization”. Which is way saner than the usual interpretation!
So idk, especially in this context I think it’s a bad idea to throw out your intuition when the math seems to say something else. Since the mismatch is probably coming down to some subtlety in this formalization of provability/meta-methamatics. And I presently think the quirky nature of provability logic is often bugs due to bad choices in the formalism.
Yeah I think my complaint is that OpenAI seems to be asserting almost a “boundary” re goal (B), like there’s nothing that trades off against staying at the front of the race, and they’re willing to pay large costs rather than risk being the second-most-impressive AI lab. Why? Things don’t add up.
(Example large cost: they’re not putting large organizational attention to the alignment problem. The alignment team projects don’t have many people working on them, they’re not doing things like inviting careful thinkers to evaluate their plans under secrecy, or taking any other bunch of obvious actions that come from putting serious resources into not blowing everyone up.)
I don’t buy that (B) is that important. It seems more driven by some strange status / narrative-power thing? And I haven’t ever seen them make an explicit their case for why they’re sacrificing so much for (B). Especially when a lot of their original safety people fucking left due to some conflict around this?
Broadly many things about their behaviour strike me as deceptive / making it hard to form a counternarrative / trying to conceal something odd about their plans.
One final question: why do they say “we think it would be good if an international agency limited compute growth” but not also “and we will obviously be trying to partner with other labs to do this ourselves in the meantime, although not if another lab is already training something more powerful than GPT-4″?
I kinda reject the energy of the hypothetical? But I can speak to some things I wish I saw OpenAI doing:
Having some internal sense amongst employees about whether they’re doing something “good” given the stakes, like Google’s old “don’t be evil” thing. Have a culture of thinking carefully about things and managers taking considerations seriously, rather than something more like management trying to extract as much engineering as quickly as possible without “drama” getting in the way.
(Perhaps they already have a culture like this! I haven’t worked there. But my prediction is that it is not, and the org has a more “extractive” relationship to its employees. I think that this is bad, causes working toward danger, and exacerbates bad outcomes.)
To the extent that they’re trying to have the best AGI tech in order to provide “leadership” of humanity and AI, I want to see them be less shady / marketing / spreading confusion about the stakes.
They worked to pervert the term “alignment” to be about whether you can extract more value from their LLMs, and distract from the idea that we might make digital minds that are copyable and improvable, while also large and hard to control. (While pushing directly on AGI designs that have the “large and hard to control” property, which I guess they’re denying is a mistake, but anyhow.)
I would like to see less things perverted/distracted/confused, like it’s according-to-me entirely possible for them to state more clearly what the end of all this is, and be more explicit about how they’re trying to lead the effort.
Reconcile with Anthropic. There is no reason, speaking on humanity’s behalf, to risk two different trajectories of giant LLMs built with subtly different technology, while dividing up the safety know-how amidst both organizations.
Furthermore, I think OpenAI kind-of stole/appropriated the scaling idea from the Anthropic founders, who left when they lost a political battle about the direction of the org. I suspect it was a huge fuck-you when OpenAI tried to spread this secret to the world, and continued to grow their org around it, while ousting the originators. If my model is at-all-accurate, I don’t like it, and OpenAI should look to regain “good standing” by acknowledging this (perhaps just privately), and looking to cooperate.
Idk, maybe it’s now legally impossible/untenable for the orgs to work together, given the investors or something? Or given mutual assumption of bad-faith? But in any case this seems really shitty.
I also mentioned some other things in this comment.
I really should have something short to say, that turns the whole argument on its head, given how clear-cut it seems to me. I don’t have that yet, but I do have some rambly things to say.
I basically don’t think overhangs are a good way to think about things, because the bridge that connects an “overhang” to an outcome like “bad AI” seems flimsy to me. I would like to see a fuller explication some time from OpenAI (or a suitable steelman!) that can be critiqued. But here are some of my thoughts.
The usual argument that leads from “overhang” to “we all die” has some imaginary other actor who is scaling up their methods with abandon at the end, killing us all because it’s not hard to scale and they aren’t cautious. This is then used to justify scaling up your own method with abandon, hoping that we’re not about to collectively fall off a cliff.
For one thing, the hype and work being done now is making this problem a lot worse at all future timesteps. There was (and still is) a lot people need to figure out regarding effectively using lots of compute. (For instance, architectures that can be scaled up, training methods and hyperparameters, efficient compute kernels, putting together datacenters and interconnect, data, etc etc.) Every chipmaker these days has started working on things with a lot of memory right next to a lot compute with a tonne of bandwidth, tailored to these large models. These are barriers-to-entry that it would have been better to leave in place, if one was concerned with rapid capability gains. And just publishing fewer things and giving out fewer hints would have helped.
Another thing: I would take the whole argument as being more in good-faith if I saw attempts being made to scale up anything other than capabilities at high speed, or signs that made it seem at all likely that “alignment” might be on track. Examples:
A single alignment result that was supported by a lot of OpenAI staff. (Compare and contrast the support that the alignment team’s projects get to what a main training run gets.)
Any focus on trying to claw cognition back out of the giant inscrutable floating-point numbers, into a domain easier to understand, rather than pouring more power into the systems that get much harder to inspect as you scale them. (Failure to do this suggests OpenAI and others are mostly just doing what they know how to do, rather than grappling with navigating us toward better AI foundations.)
Any success in understanding how shallow vs deep the thinking of the LLMs is, in the sense of “how long a chain of thoughts/inferences can it make as it composes dialogue”, and how this changes with scale. (Since the whole “LLMs are safer” thing relies on their thinking being coupled to the text they output; otherwise you’re back in giant inscrutable RL agent territory)
The delta between “intelligence embedded somewhere in the system” and “intelligence we can make use of” looking smaller than it does. (Since if our AI gets to use of more of its intelligence than us, and this gets worse as we scale, this looks pretty bad for the “use our AI to tame the AI before it’s too late” plan.)
Also I can’t make this point precisely, but I think there’s something like capabilities progress just leaves more digital fissile material lying around the place, especially when published and hyped. And if you don’t want “fast takeoff”, you want less fissile material lying around, lest it get assembled into something dangerous.
Finally, to more directly talk about LLMs, my crux for whether they’re “safer” than some hypothetical alternative is about how much of the LLM “thinking” is closely bound to the text being read/written. My current read is that they’re more like doing free-form thinking inside, that tries to concentrate mass on right prediction. As we scale that up, I worry that any “strange competence” we see emerging is due to the LLM having something like a mind inside, and less due to it having accrued more patterns.
As usual, the part that seems bonkers crazy is where they claim the best thing they can do is keep making every scrap of capabilities progress they can. Keep making AI as smart as possible, as fast as possible.
“This margin is too small to contain our elegant but unintuitive reasoning for why”. Grump. Let’s please have a real discussion about this some time.
(Edit: others have made this point already, but anyhow)
My main objection to this angle: self-improvements do not necessarily look like “design a successor AI to be in charge”. They can look more like “acquire better world models”, “spin up more copies”, “build better processors”, “train lots of narrow AI to act as fingers”, etc.
I don’t expect an AI mind to have trouble finding lots of pathways like these (that tractably improve abilities without risking a misalignment catastrophe) that take it well above human level, given the chance.
Is the following an accurate summary?
The agent is built to have a “utility function” input that the humans can change over time, and a probability distribution over what the humans will ask for at different time steps, and maximizes according a combination of the utility functions it anticipates across time steps?
If that’s correct, here are some places this conflicts with my intuition about how things should be done:
I feel awkward about the randomness is being treated essential. I’d rather be able to do something other than randomness in order to get my mild optimization, and something feels unstable/non-compositional about needing randomness in place for your evaluations… (Not that I have an alternative that springs to mind!)
I also feel like “worst case” is perhaps problematic, since it’s bringing maximization in, and you’re then needing to rely on your convex set being some kind of smooth in order to get good outcomes. If I have a distribution over potential utility functions, and quantilize for the worst 10% of possibilities, does that do the same sort of work that “worst case” is doing for mild optimization?
Can I check that I follow how you recover quantilization?
Are you evaluating distributions over actions, and caring about the worst-case expectation of that distribution?
If so, proposing a particular action is evaluated badly? (Since there’s a utility function in your set that spikes downward at that action.)
But proposing a range of actions to randomize amongst can be assessed to have decent worst-case expected utility, since particular downward spikes get smoothed over, and you can rely on your knowledge of “in-distribution” behaviour?
Edited to add: fwiw it seems awesome to see quantilization formalized as popping out of an adversarial robustness setup! I haven’t seen something like this before, and didn’t notice if the infrabayes tools were building to these kinds of results. I’m very much wanting to understand why this works in my own native-ontology-pieces.
I want to say that I agree the transformer circuits work is great, and that I like it, and am glad I had the opportunity to read it! I still expect it was pretty harmful to publish.
Nerdsniping goes both ways: you also inspire things like the Hyena work trying to improve architectures based on components of what transformers can do.
I think indiscriminate hype and trying to do work that will be broadly attention-grabbing falls on the wrong side, likely doing net harm. Because capabilities improvements seem empirically easier than understanding them, and there’s a lot more attention/people/incentives for capabilities.
I think there are more targeted things that would be better for getting more good work to happen. Like research workshops or unconferences, where you choose who to invite, or building community with more aligned folk who are looking for interesting and alignment-relevant research directions. This would come with way less potential harm imo as a recruitment strategy.
Hm I should also ask if you’ve seen the results of current work and think it’s evidence that we get more understandable models, moreso than we get more capable models?
I think the issue is that when you get more understandable base components, and someone builds an AGI out of those, you still don’t understand the AGI.
That research is surely helpful though if it’s being used to make better-understood things, rather than enabling folk to make worse-understood more-powerful things.
I think moving in the direction of “insights are shared with groups the researcher trusts” should broadly help with this.
I’m perhaps misusing “publish” here, to refer to “putting stuff on the internet” and “raising awareness of the work through company Twitter” and etc.
I mostly meant to say that, as I see it, too many things that shouldn’t be published are being published, and the net effect looks plausibly terrible with little upside (though not much has happened yet in either direction).
The transformer circuits work strikes me this way, so does a bunch of others.
Also, I’m grateful to know your read! I’m broadly interested to hear this and other raw viewpoints, to get a sense of how things look to other people.
I mostly do just mean “keeping it within a single research group” in the absence of better ideas. And I don’t have a better answer, especially not for independent folk or small orgs.
I wonder if we need an arxiv or LessWrong clone where you whitelist who you want to discuss your work with. And some scheme for helping independents find each other, or find existing groups they trust. Maybe with some “I won’t use this for capabilities work without the permission of the authors” legal docs as well.
This isn’t something I can visualize working, but maybe it has components of an answer.
See also: LLMs Sometimes Generate Purely Negatively-Reinforced Text