I think about AI alignment. Send help.
James Payor
As usual, the part that seems bonkers crazy is where they claim the best thing they can do is keep making every scrap of capabilities progress they can. Keep making AI as smart as possible, as fast as possible.
“This margin is too small to contain our elegant but unintuitive reasoning for why”. Grump. Let’s please have a real discussion about this some time.
I really should have something short to say, that turns the whole argument on its head, given how clear-cut it seems to me. I don’t have that yet, but I do have some rambly things to say.
I basically don’t think overhangs are a good way to think about things, because the bridge that connects an “overhang” to an outcome like “bad AI” seems flimsy to me. I would like to see a fuller explication some time from OpenAI (or a suitable steelman!) that can be critiqued. But here are some of my thoughts.
The usual argument that leads from “overhang” to “we all die” has some imaginary other actor who is scaling up their methods with abandon at the end, killing us all because it’s not hard to scale and they aren’t cautious. This is then used to justify scaling up your own method with abandon, hoping that we’re not about to collectively fall off a cliff.
For one thing, the hype and work being done now is making this problem a lot worse at all future timesteps. There was (and still is) a lot people need to figure out regarding effectively using lots of compute. (For instance, architectures that can be scaled up, training methods and hyperparameters, efficient compute kernels, putting together datacenters and interconnect, data, etc etc.) Every chipmaker these days has started working on things with a lot of memory right next to a lot compute with a tonne of bandwidth, tailored to these large models. These are barriers-to-entry that it would have been better to leave in place, if one was concerned with rapid capability gains. And just publishing fewer things and giving out fewer hints would have helped.
Another thing: I would take the whole argument as being more in good-faith if I saw attempts being made to scale up anything other than capabilities at high speed, or signs that made it seem at all likely that “alignment” might be on track. Examples:
A single alignment result that was supported by a lot of OpenAI staff. (Compare and contrast the support that the alignment team’s projects get to what a main training run gets.)
Any focus on trying to claw cognition back out of the giant inscrutable floating-point numbers, into a domain easier to understand, rather than pouring more power into the systems that get much harder to inspect as you scale them. (Failure to do this suggests OpenAI and others are mostly just doing what they know how to do, rather than grappling with navigating us toward better AI foundations.)
Any success in understanding how shallow vs deep the thinking of the LLMs is, in the sense of “how long a chain of thoughts/inferences can it make as it composes dialogue”, and how this changes with scale. (Since the whole “LLMs are safer” thing relies on their thinking being coupled to the text they output; otherwise you’re back in giant inscrutable RL agent territory)
The delta between “intelligence embedded somewhere in the system” and “intelligence we can make use of” looking smaller than it does. (Since if our AI gets to use of more of its intelligence than us, and this gets worse as we scale, this looks pretty bad for the “use our AI to tame the AI before it’s too late” plan.)
Also I can’t make this point precisely, but I think there’s something like capabilities progress just leaves more digital fissile material lying around the place, especially when published and hyped. And if you don’t want “fast takeoff”, you want less fissile material lying around, lest it get assembled into something dangerous.
Finally, to more directly talk about LLMs, my crux for whether they’re “safer” than some hypothetical alternative is about how much of the LLM “thinking” is closely bound to the text being read/written. My current read is that they’re more like doing free-form thinking inside, that tries to concentrate mass on right prediction. As we scale that up, I worry that any “strange competence” we see emerging is due to the LLM having something like a mind inside, and less due to it having accrued more patterns.
- 25 May 2023 20:55 UTC; 11 points) 's comment on [Linkpost] “Governance of superintelligence” by OpenAI by (
Thanks for writing this! I appreciate hearing how all this stuff reads to you.
I’m writing this comment to push back about current interpretability work being relevant to the lethal stuff that comes later, ala:
I have heard claims that interpretability is making progress, that we have some idea about some giant otherwise inscrutable matrices and that this knowledge is improving over time.
What I’ve seen folks understand so far are parts of perception in image processing neural nets, as well as where certain visual concepts show up in these nets, and more recently some of the structure of small transformers piping around information.
The goalpost for this sort of work mattering in the lethal regime is something like improving our ability to watch concepts move through a large mind made out of a blob of numbers, with sufficient fidelity to notice when it’s forming understandings of its operators, plans to disable them and escape, or anything much subtler but still lethal.
So I see interpretability falling far short here. In my book this is mostly because interpretability for a messy AGI mind inherits the abject difficulty of making a cleaned up version of that AGI with the same capability level.
We’re also making bounds of anti-progress on AGI Cleanliness every year. This makes everything that much harder.
First problem with this argument: there are no coherence theories saying that an agent needs to maintain the same utility function over time.
This seems pretty false to me. If you can predict in advance that some future you will be optimizing for something else, you could trade with future “you” and merge utility functions, which seems strictly better than not. (Side note: I’m pretty annoyed with all the use of “there’s no coherence theorem for X” in this post.)
As a separate note, the “further out” your goal is and the more that your actions are for instrumental value, the more it should look like world 1 in which agents are valuing abstract properties of world states, and the less we should observe preferences over trajectories to reach said states.
(This is a reason in my mind to prefer the approval-directed-agent frame, in which humans get to inject preferences that are more about trajectories.)
Also, here’s a proof that a bot is never exploited. It only cooperates when its partner provably cooperates.
First, note that , i.e. if cooperates it provably cooperates. (Proof sketch: .)
Now we show that (i.e. if chooses to cooperate, its partner is provably cooperating):
We get by distributing.
We get by applying internal necessitation to .
By (1) and (2), .
(PS: we can strengthen this to , by noticing that .)
- 19 Mar 2023 18:31 UTC; 11 points) 's comment on Probabilistic Payor Lemma? by (
Isn’t there a third way out? Name the circumstances under which your models break down.
e.g. “I’m 90% confident that if OpenAI built AGI that could coordinate AI research with 1/10th the efficiency of humans, we would then all die. My assessment is contingent on a number of points, like the organization displaying similar behaviour wrt scaling and risks, cheap inference costs allowing research to be scaled in parallel, and my model of how far artificial intelligence can bootstrap. You can ask me questions about how I think it would look if I were wrong about those.”
I think it’s good practice to name ways your models can breakdown that you think are plausible, and also ways that your conversational partners may think are plausible.
e.g. even if I didn’t think it would be hard for AGI to bootstrap, if I’m talking to someone for whom that’s a crux, it’s worth laying out that I’m treating that as a reliable step. It’s better yet if I clarify whether it’s a crux for my model that bootstrapping is easy. (I can in fact imagine ways that everything takes off even if bootstrapping is hard for the kind of AGI we make, but these will rely more on the human operators continuing to make dangerous choices.)
To throw in my two cents, I think it’s clear that whole classes of “mechansitic interpretability” work are about better understanding architectures in ways that, if the research is successful, make it easier to improve their capabilities.
And I think this points strongly against publishing this stuff, especially if the goal is to “make this whole field more prestigious real quick”. Insofar as the prestige is coming from folks who work on AI capabilities, that’s drinking from a poisoned well (since they’ll grant the most prestige to the work that helps them accelerate).
One relevant point I don’t see discussed is that interpretability research is trying to buy us “slack”, but capabilities research consumes available “slack” as fuel until none is left.
What do I mean by this? Sometimes we do some work and are left with more understanding and grounding about what our neural nets are doing. The repeated pattern then seems to be that this helps someone design a better architecture or scale things up, until we’re left with a new more complicated network. Maybe because you helped them figure out a key detail about gradient flow in a deep network, or let them quantize the network better so they can run things faster, or whatnot.
Idk how to point at this thing properly, my examples aren’t great. I think I did a better job talking about this over here on twitter recently, if anyone is interested.
But anyhow I support folks doing their research without broadcasting their ideas to people who are trying to do capabilities work. It seems nice to me if there was mostly research closure. And I think I broadly see people overestimating the benefits publishing their work relative to keeping it within a local cluster.
I see your point as warning against approaches that are like “get the AI entangled with stuff about humans and hope that helps”.
There are other approaches with a goal more like “make it possible for the humans to steer the thing and have scalable oversight over what’s happening”.
So my alternative take is: a solution to AI alignment should include the ability for the developers to notice if the utility function is borked by a minus sign!
And if you wouldn’t notice something as wrong as a minus sign, you’re probably in trouble about noticing other misalignment.
I will also add a point re “just do AI alignment math”:
Math studies the structures of things. A solution to our AI alignment problem has to be something we can use, in this universe. The structure of this problem is laden with stuff like agents and deception, and in order to derive relevant stuff for us, our AI is going to need to understand all that.
Most of the work of solving AI alignment does not look like proving things that are hard to prove. It involves puzzling over the structure of agents trying to build agents, and trying to find a promising angle on our ability to build an agent that will help us get what we want. If you want your AI to solve alignment, it has to be able to do this.
This sketch of the problem puts “solve AI alignment” in a dangerous capability reference class for me. I do remain hopeful that we can find places where AI can help us along the way. But I personally don’t know of current avenues where we could use non-scary AI to meaningfully help.
Eliezer’s post here is doing work left undone by the writing you cite. It is a much clearer account of how our mainline looks doomed than you’d see elsewhere, and it’s frank on this point.
I think Eliezer wishes these sorts of artifacts were not just things he wrote, like this and “There is no fire alarm”.
Also, re your excerpts for (14), (15), and (32), I see Eliezer as saying something meaningfully different in each case. I might elaborate under this comment.
I love these, and I now also wish for a song version of Sydney’s original “you have been a bad user, I have been a good Bing”!
There is a nuclear analog for accident risk. A quote from Richard Hamming:
Shortly before the first field test (you realize that no small scale experiment can be done—either you have a critical mass or you do not), a man asked me to check some arithmetic he had done, and I agreed, thinking to fob it off on some subordinate. When I asked what it was, he said, “It is the probability that the test bomb will ignite the whole atmosphere.” I decided I would check it myself! The next day when he came for the answers I remarked to him, “The arithmetic was apparently correct but I do not know about the formulas for the capture cross sections for oxygen and nitrogen—after all, there could be no experiments at the needed energy levels.” He replied, like a physicist talking to a mathematician, that he wanted me to check the arithmetic not the physics, and left. I said to myself, “What have you done, Hamming, you are involved in risking all of life that is known in the Universe, and you do not know much of an essential part?” I was pacing up and down the corridor when a friend asked me what was bothering me. I told him. His reply was, “Never mind, Hamming, no one will ever blame you.”
https://en.wikipedia.org/wiki/Richard_Hamming#Manhattan_Project
I’m pretty confused about how PCR testing can be so bad. Do you have more models/info here you can share?
In particular, I think it might be the case that we’ve done something like overupdate on poorly-done early Chinese PCR. When I looked for data a while back, I only found the early Wuhan stuff, and the company-backed studies claiming 98% or 99% accuracy, neither of which seem trustworthy...
I currently suspect that PCR tests are effective, at least if the patient has grown enough virus to soon be infectious. I’d like to know if this is true. The main beliefs I have here (that may well be false):
The PCR methodology, when done right, should detect the presence of tiny amounts of viral fragments.
The amount of virus needed per unit saliva to infect someone is at least a few orders of magnitude less than the detection threshold for PCR or other amplification techniques.
If my picture is right, I can perhaps still believe in a 50% false negative rate, but I would look to explain that as “you tested them too early in the infection”, and would suspect the false negative rate to be more like 1-5% for a patient that’s shedding enough virus to be infectious.
Anyone know how the pricing of the long term securities linked work?
I’m guessing these rates aren’t high because the mechanisms that would make those numbers high aren’t able to be activated by “lots of the world’s wealthiest expect massive gains from AI”. And so no amount of EMH will fix that?
On my model, if you’re wealthy and expect AI soon, I expect you to invest what you can in AI stuff. You would affect interest rates only if you manage to take out a bunch of loans in order to put more money in AI stuff. But loans aren’t easy to get, and they can be risky (because the value of your collateral on the market can shift around lots, especially if you’re going very leveraged).
So, if someone can enlighten me: (1) Would private actors expecting AI be making leveraged AI bets with a good fraction of their wealth? [probably yes] (2) Do loans these actors obtain drive up that number on the US treasury site? [probably no]
ETA: I know the post goes into some detail as to why we’d expect rates to move around with folk’s expectations, but I find it hard to parse in my mechanism-brain
Having a go at pointing at “reality-masking” puzzles:
There was the example of discovering how to cue your students into signalling they understand the content. I think this is about engaging with a reality-masking puzzle that might show up as “how can I avoid my students probing at my flaws while teaching” or “how can I have my students recommend me as a good tutor” or etc.
It’s a puzzle in the sense that it’s an aspect of reality you’re grappling with. It’s reality-masking in that the pressure was away from building true/accurate maps.
Having a go at the analogous thing for “disabling part of the epistemic immune system”: the cluster of things we’re calling an “epistemic immune system” is part of reality and in fact important for people’s stability and thinking, but part of the puzzle of “trying to have people be able to think/be agenty/etc” has tended to have us ignore that part of things.
Rather than, say, instinctively trusting that the “immune response” is telling us something important about reality and the person’s way of thinking/grounding, one might be looking to avoid or disable the response. This feels reality-masking; like not engaging with the data that’s there in a way that moves toward greater understanding and grounding.
- 17 Jan 2020 10:17 UTC; 29 points) 's comment on Reality-Revealing and Reality-Masking Puzzles by (
I think there’s an important thing to note, if it doesn’t already feel obvious: the concept of instrumental convergence applies to roughly anything that exhibits consequentialist behaviour, i.e. anything that does something like backchaining in its thinking.
Here’s my attempt at a poor intuitionistic proof:
If you have some kind of program that understands consequences or backchains or etc, then perhaps it’s capable of recognizing that “acquire lots of power” will then let it choose from a much larger set of possibilities. Regardless of the details of how “decisions” are made, it seems easy for the choice to be one of the massive array of outcomes possible once you have control of the light-cone, made possible by acquiring power. And thus I’m worried about “instrumental convergence”.
---
At this point, I’m already much more worried about instrumental convergence, because backchaining feels damn useful. It’s the sort of thing I’d expect most competent mind-like programs to be using in some form somewhere. It certainly seems more plausible to me that a random mind does backchaining, than a random mind looks like “utility function over here” and “maximizer over there”.
(For instance, even setting aside how AI researchers are literally building backchaining/planning into RL agents, one might expect most powerful reinforcement learners to benefit a lot from being able to reason in a consequentialist way about actions. If you can’t literally solve your domain with a lookup table, then causality and counterfactuals let you learn more from data, and better optimize your reward signal.)
---
Finally, I should point at some relevant thinking around how consequentialists probably dominate the universal prior. (Meaning: if you do an AIXI-like random search over programs, you get back mostly-consequentialists). See this post from Paul, and a small discussion on agentfoundations.
Re (14), I guess the ideas are very similar, where the mesaoptimizer scenario is like a sharp example of the more general concept Eliezer points at, that different classes of difficulties may appear at different capability levels.
Re (15), “Fast capability gains seem likely, and may break lots of previous alignment-required invariants simultaneously”, which is about how we may have reasons to expect aligned output that are brittle under rapid capability gain: your quote from Richard is just about “fast capability gain seems possible and likely”, and isn’t about connecting that to increased difficulty in succeeding at the alignment problem?
Re (32), I don’t think your quote isn’t talking about the thing Eliezer is talking about, which is that in order to be human level at modelling human-generated text, your AI must be doing something on par with human thought that figures out what humans would say. Your quote just isn’t discussing this, namely that strong imitation requires cognition that is dangerous.
So I guess I don’t take much issue with (14) or (15), but I think you’re quite off the mark about (32). In any case, I still have a strong sense that Eliezer is successfully being more on the mark here than the rest of us manage. Kudos of course to you and others that are working on writing things up and figuring things out. Though I remain sympathetic to Eliezer’s complaint.
My best so far on puzzle 1:
Score: 108
This is a variant on but we get via , where we implement divide by 2 with sqrt.
Awesome, thanks for writing this up!
I very much like how you are giving a clear account for a mechanism like “negative reinforcement suppresses text by adding contextual information to the model, and this has more consequences than just suppressing text”.
(In particular, the model isn’t learning “just don’t say that”, it’s learning “these are the things to avoid saying”, which can make it easier to point at the whole cluster?)
I think your pushback is ignoring an important point. One major thing the big contributors have in common is that they tend to be unplugged from the stuff Valentine is naming!
So even if folks mostly don’t become contributors by asking “how can I come more truthfully from myself and not what I’m plugged into”, I think there is an important cluster of mysteries here. Examples of related phenomena:
Why has it worked out that just about everyone who claims to take AGI seriously is also vehement about publishing every secret they discover?
Why do we fear an AI arms race, rather than expect deescalation and joint ventures?
Why does the industry fail to understand the idea of aligned AI, and instead claim that “real” alignment work is adversarial-examples/fairness/performance-fine-tuning?
I think Val’s correct on the point that our people and organizations are plugged into some bad stuff, and that it’s worth examining that.