I think about AI alignment. Send help.
James Payor
Thinking about maximization and corrigibility
As usual, the part that seems bonkers crazy is where they claim the best thing they can do is keep making every scrap of capabilities progress they can. Keep making AI as smart as possible, as fast as possible.
“This margin is too small to contain our elegant but unintuitive reasoning for why”. Grump. Let’s please have a real discussion about this some time.
Some constructions for proof-based cooperation without Löb
I really should have something short to say, that turns the whole argument on its head, given how clear-cut it seems to me. I don’t have that yet, but I do have some rambly things to say.
I basically don’t think overhangs are a good way to think about things, because the bridge that connects an “overhang” to an outcome like “bad AI” seems flimsy to me. I would like to see a fuller explication some time from OpenAI (or a suitable steelman!) that can be critiqued. But here are some of my thoughts.
The usual argument that leads from “overhang” to “we all die” has some imaginary other actor who is scaling up their methods with abandon at the end, killing us all because it’s not hard to scale and they aren’t cautious. This is then used to justify scaling up your own method with abandon, hoping that we’re not about to collectively fall off a cliff.
For one thing, the hype and work being done now is making this problem a lot worse at all future timesteps. There was (and still is) a lot people need to figure out regarding effectively using lots of compute. (For instance, architectures that can be scaled up, training methods and hyperparameters, efficient compute kernels, putting together datacenters and interconnect, data, etc etc.) Every chipmaker these days has started working on things with a lot of memory right next to a lot compute with a tonne of bandwidth, tailored to these large models. These are barriers-to-entry that it would have been better to leave in place, if one was concerned with rapid capability gains. And just publishing fewer things and giving out fewer hints would have helped.
Another thing: I would take the whole argument as being more in good-faith if I saw attempts being made to scale up anything other than capabilities at high speed, or signs that made it seem at all likely that “alignment” might be on track. Examples:
A single alignment result that was supported by a lot of OpenAI staff. (Compare and contrast the support that the alignment team’s projects get to what a main training run gets.)
Any focus on trying to claw cognition back out of the giant inscrutable floating-point numbers, into a domain easier to understand, rather than pouring more power into the systems that get much harder to inspect as you scale them. (Failure to do this suggests OpenAI and others are mostly just doing what they know how to do, rather than grappling with navigating us toward better AI foundations.)
Any success in understanding how shallow vs deep the thinking of the LLMs is, in the sense of “how long a chain of thoughts/inferences can it make as it composes dialogue”, and how this changes with scale. (Since the whole “LLMs are safer” thing relies on their thinking being coupled to the text they output; otherwise you’re back in giant inscrutable RL agent territory)
The delta between “intelligence embedded somewhere in the system” and “intelligence we can make use of” looking smaller than it does. (Since if our AI gets to use of more of its intelligence than us, and this gets worse as we scale, this looks pretty bad for the “use our AI to tame the AI before it’s too late” plan.)
Also I can’t make this point precisely, but I think there’s something like capabilities progress just leaves more digital fissile material lying around the place, especially when published and hyped. And if you don’t want “fast takeoff”, you want less fissile material lying around, lest it get assembled into something dangerous.
Finally, to more directly talk about LLMs, my crux for whether they’re “safer” than some hypothetical alternative is about how much of the LLM “thinking” is closely bound to the text being read/written. My current read is that they’re more like doing free-form thinking inside, that tries to concentrate mass on right prediction. As we scale that up, I worry that any “strange competence” we see emerging is due to the LLM having something like a mind inside, and less due to it having accrued more patterns.
- 25 May 2023 20:55 UTC; 11 points) 's comment on [Linkpost] “Governance of superintelligence” by OpenAI by (
Thanks for writing this! I appreciate hearing how all this stuff reads to you.
I’m writing this comment to push back about current interpretability work being relevant to the lethal stuff that comes later, ala:
I have heard claims that interpretability is making progress, that we have some idea about some giant otherwise inscrutable matrices and that this knowledge is improving over time.
What I’ve seen folks understand so far are parts of perception in image processing neural nets, as well as where certain visual concepts show up in these nets, and more recently some of the structure of small transformers piping around information.
The goalpost for this sort of work mattering in the lethal regime is something like improving our ability to watch concepts move through a large mind made out of a blob of numbers, with sufficient fidelity to notice when it’s forming understandings of its operators, plans to disable them and escape, or anything much subtler but still lethal.
So I see interpretability falling far short here. In my book this is mostly because interpretability for a messy AGI mind inherits the abject difficulty of making a cleaned up version of that AGI with the same capability level.
We’re also making bounds of anti-progress on AGI Cleanliness every year. This makes everything that much harder.
First problem with this argument: there are no coherence theories saying that an agent needs to maintain the same utility function over time.
This seems pretty false to me. If you can predict in advance that some future you will be optimizing for something else, you could trade with future “you” and merge utility functions, which seems strictly better than not. (Side note: I’m pretty annoyed with all the use of “there’s no coherence theorem for X” in this post.)
As a separate note, the “further out” your goal is and the more that your actions are for instrumental value, the more it should look like world 1 in which agents are valuing abstract properties of world states, and the less we should observe preferences over trajectories to reach said states.
(This is a reason in my mind to prefer the approval-directed-agent frame, in which humans get to inject preferences that are more about trajectories.)
Also, here’s a proof that a bot is never exploited. It only cooperates when its partner provably cooperates.
First, note that , i.e. if cooperates it provably cooperates. (Proof sketch: .)
Now we show that (i.e. if chooses to cooperate, its partner is provably cooperating):
We get by distributing.
We get by applying internal necessitation to .
By (1) and (2), .
(PS: we can strengthen this to , by noticing that .)
- 19 Mar 2023 18:31 UTC; 11 points) 's comment on Probabilistic Payor Lemma? by (
Isn’t there a third way out? Name the circumstances under which your models break down.
e.g. “I’m 90% confident that if OpenAI built AGI that could coordinate AI research with 1/10th the efficiency of humans, we would then all die. My assessment is contingent on a number of points, like the organization displaying similar behaviour wrt scaling and risks, cheap inference costs allowing research to be scaled in parallel, and my model of how far artificial intelligence can bootstrap. You can ask me questions about how I think it would look if I were wrong about those.”
I think it’s good practice to name ways your models can breakdown that you think are plausible, and also ways that your conversational partners may think are plausible.
e.g. even if I didn’t think it would be hard for AGI to bootstrap, if I’m talking to someone for whom that’s a crux, it’s worth laying out that I’m treating that as a reliable step. It’s better yet if I clarify whether it’s a crux for my model that bootstrapping is easy. (I can in fact imagine ways that everything takes off even if bootstrapping is hard for the kind of AGI we make, but these will rely more on the human operators continuing to make dangerous choices.)
To throw in my two cents, I think it’s clear that whole classes of “mechansitic interpretability” work are about better understanding architectures in ways that, if the research is successful, make it easier to improve their capabilities.
And I think this points strongly against publishing this stuff, especially if the goal is to “make this whole field more prestigious real quick”. Insofar as the prestige is coming from folks who work on AI capabilities, that’s drinking from a poisoned well (since they’ll grant the most prestige to the work that helps them accelerate).
One relevant point I don’t see discussed is that interpretability research is trying to buy us “slack”, but capabilities research consumes available “slack” as fuel until none is left.
What do I mean by this? Sometimes we do some work and are left with more understanding and grounding about what our neural nets are doing. The repeated pattern then seems to be that this helps someone design a better architecture or scale things up, until we’re left with a new more complicated network. Maybe because you helped them figure out a key detail about gradient flow in a deep network, or let them quantize the network better so they can run things faster, or whatnot.
Idk how to point at this thing properly, my examples aren’t great. I think I did a better job talking about this over here on twitter recently, if anyone is interested.
But anyhow I support folks doing their research without broadcasting their ideas to people who are trying to do capabilities work. It seems nice to me if there was mostly research closure. And I think I broadly see people overestimating the benefits publishing their work relative to keeping it within a local cluster.
I see your point as warning against approaches that are like “get the AI entangled with stuff about humans and hope that helps”.
There are other approaches with a goal more like “make it possible for the humans to steer the thing and have scalable oversight over what’s happening”.
So my alternative take is: a solution to AI alignment should include the ability for the developers to notice if the utility function is borked by a minus sign!
And if you wouldn’t notice something as wrong as a minus sign, you’re probably in trouble about noticing other misalignment.
I will also add a point re “just do AI alignment math”:
Math studies the structures of things. A solution to our AI alignment problem has to be something we can use, in this universe. The structure of this problem is laden with stuff like agents and deception, and in order to derive relevant stuff for us, our AI is going to need to understand all that.
Most of the work of solving AI alignment does not look like proving things that are hard to prove. It involves puzzling over the structure of agents trying to build agents, and trying to find a promising angle on our ability to build an agent that will help us get what we want. If you want your AI to solve alignment, it has to be able to do this.
This sketch of the problem puts “solve AI alignment” in a dangerous capability reference class for me. I do remain hopeful that we can find places where AI can help us along the way. But I personally don’t know of current avenues where we could use non-scary AI to meaningfully help.
Eliezer’s post here is doing work left undone by the writing you cite. It is a much clearer account of how our mainline looks doomed than you’d see elsewhere, and it’s frank on this point.
I think Eliezer wishes these sorts of artifacts were not just things he wrote, like this and “There is no fire alarm”.
Also, re your excerpts for (14), (15), and (32), I see Eliezer as saying something meaningfully different in each case. I might elaborate under this comment.
I love these, and I now also wish for a song version of Sydney’s original “you have been a bad user, I have been a good Bing”!
There is a nuclear analog for accident risk. A quote from Richard Hamming:
Shortly before the first field test (you realize that no small scale experiment can be done—either you have a critical mass or you do not), a man asked me to check some arithmetic he had done, and I agreed, thinking to fob it off on some subordinate. When I asked what it was, he said, “It is the probability that the test bomb will ignite the whole atmosphere.” I decided I would check it myself! The next day when he came for the answers I remarked to him, “The arithmetic was apparently correct but I do not know about the formulas for the capture cross sections for oxygen and nitrogen—after all, there could be no experiments at the needed energy levels.” He replied, like a physicist talking to a mathematician, that he wanted me to check the arithmetic not the physics, and left. I said to myself, “What have you done, Hamming, you are involved in risking all of life that is known in the Universe, and you do not know much of an essential part?” I was pacing up and down the corridor when a friend asked me what was bothering me. I told him. His reply was, “Never mind, Hamming, no one will ever blame you.”
https://en.wikipedia.org/wiki/Richard_Hamming#Manhattan_Project
I’m pretty confused about how PCR testing can be so bad. Do you have more models/info here you can share?
In particular, I think it might be the case that we’ve done something like overupdate on poorly-done early Chinese PCR. When I looked for data a while back, I only found the early Wuhan stuff, and the company-backed studies claiming 98% or 99% accuracy, neither of which seem trustworthy...
I currently suspect that PCR tests are effective, at least if the patient has grown enough virus to soon be infectious. I’d like to know if this is true. The main beliefs I have here (that may well be false):
The PCR methodology, when done right, should detect the presence of tiny amounts of viral fragments.
The amount of virus needed per unit saliva to infect someone is at least a few orders of magnitude less than the detection threshold for PCR or other amplification techniques.
If my picture is right, I can perhaps still believe in a 50% false negative rate, but I would look to explain that as “you tested them too early in the infection”, and would suspect the false negative rate to be more like 1-5% for a patient that’s shedding enough virus to be infectious.
Anyone know how the pricing of the long term securities linked work?
I’m guessing these rates aren’t high because the mechanisms that would make those numbers high aren’t able to be activated by “lots of the world’s wealthiest expect massive gains from AI”. And so no amount of EMH will fix that?
On my model, if you’re wealthy and expect AI soon, I expect you to invest what you can in AI stuff. You would affect interest rates only if you manage to take out a bunch of loans in order to put more money in AI stuff. But loans aren’t easy to get, and they can be risky (because the value of your collateral on the market can shift around lots, especially if you’re going very leveraged).
So, if someone can enlighten me: (1) Would private actors expecting AI be making leveraged AI bets with a good fraction of their wealth? [probably yes] (2) Do loans these actors obtain drive up that number on the US treasury site? [probably no]
ETA: I know the post goes into some detail as to why we’d expect rates to move around with folk’s expectations, but I find it hard to parse in my mechanism-brain
Hello! I’m glad to read more material on this subject.
First I want to note that it took me some time to understand the setup, since you’re working with a modified notion of maximal lottery-lotteries than the one Scott wrote about. And this made it unclear to me what was going on until I’d read a bunch through and put it together, and changes the meaning of the post’s title as well.
For that reason I’d like to recommend adding something like “Geometric” in your title. Perhaps we can then talk about this construction as “Geometric Maximal Lottery-Lotteries”, or “Maximal Geometric Lottery-Lotteries”? Whichever seems better!
It seems especially important to distinguish names because these seem to behave distinctly than the linear version. (As they have different properties in how they treat the voters, and perhaps fewer or different difficulties in existence, stability, and effective computation.)
With that out of the way, I’m a tentative fan of the geometric version, though I have more to unpack about what it means. I’ll divide my thoughts & questions into a few sections below. I am likely confused on several points. And my apologies if my writing is unclear, please ask followup questions where interesting!
Underlying models of power for majoritarian vs geometric
When reading the earlier sequence I was struck by how unwieldy the linear/majoritarian formulation ends up being! Specifically, it seemed that the full maximal-lottery-lottery would need to encode all of the competing coordination cliques in the outer lottery, but then these are unstable to small perturbations that shift coordination options from below-majority to above-majority. And this seemed like a real obstacle in effectively computing approximations, and if I undertand correctly is causing the discontinuity that breaks the Nash-equilibria-based existence proof.
My thought then about what might more sense was a model of “war”/”power” in which votes against directly cancel out votes for. So in the case of an even split we get zero utility rather than whatever the majority’s utility would be. My hope was that this was both a more realistic model of how power should work, which would also be stable to small perturbations and lend more weight to outcomes preferred by supermajorities. I never cached this out fully though, since I didn’t find an elegant justification and lost interest.
So I haven’t thought this part through much (yet), but your model here in which we are taking a geometric expectation, seems like we are in a bargaining regime that’s downstream of each voter having the ability to torpedo the whole process in favor of some zero point. And I’d conjecture that if power works like this, then thinking through fairness considerations and such we end up with the bargaining approach. I’m interested if you have a take here.
Utility specifications and zero points
I was also a big fan of the full personal utility information being relevant, since it seems that choosing the “right” outcome should take full preferences about tradeoffs into account, not just the ordering of the outcomes. It was also important to the majoritarian model of power that the scheme was invariant to (affine) changes in utility descriptions (since all that matters to it is where the votes come down).
Thinking about what’s happened with the geometric expectation, I’m wondering how I should view the input utilities. Specifically, the geometric expectation is very sensitive to points assigned zero-utility by any part of the voting measure. So we will never see probability 1 assigned to an outcome that has any voting-measure on zero utility (assuming said voting-measure assigns non-zero utility to another option).
We can at least offer say probability on the most preferred options across the voting measure, which ameloriates this.
But then I still have some questions about how I should think about the input utilities, how sensitive the scheme is to those, can I imagine it being gameable if voters are making the utility specifications, and etc.
Why lottery-lotteries rather than just lotteries
The original sequence justified lottery-lotteries with a (compelling-to-me) example about leadership vs anarchy, in which the maximal lottery cannot encode the necessary negotiating structure to find the decent outcome, but the maximal lottery-lottery could!
This coupled with the full preference-spec being relevant (i.e. taking into account what probabilistic tradeoffs each voter would be interested in) sold me pretty well on lottery-lotteries being the thing.
It seemed important then that there was something different happening on the outer and inner levels of lottery. Specifically when checking for dominance with , we would check . This is doing a majority check on the outside, and compares lotteries via an average (i.e. expected utility) on the inside.
Is there a similar two-level structure going on in this post? It seemed that your updated dominance criterion is taking an outer geometric expectation but then double-sampling through both layers of the lottery-lottery, so I’m unclear that this adds any strength beyond a single-layer “geometric maximal lottery”.
(And I haven’t tried to work through e.g. the anarchy example yet, to check if the two layers are still doing work, but perhaps you have and could illustrate?)
So yeah I was expecting to see something different in the geometric version of the condition that would still look “two-layer”, and perhaps I’m failing to parse it properly. (Or indeed I might be missing something you already wrote later in the post!) In any case I’d appreciate a natural language description of the process of comparing two lottery-lotteries.
Having a go at pointing at “reality-masking” puzzles:
There was the example of discovering how to cue your students into signalling they understand the content. I think this is about engaging with a reality-masking puzzle that might show up as “how can I avoid my students probing at my flaws while teaching” or “how can I have my students recommend me as a good tutor” or etc.
It’s a puzzle in the sense that it’s an aspect of reality you’re grappling with. It’s reality-masking in that the pressure was away from building true/accurate maps.
Having a go at the analogous thing for “disabling part of the epistemic immune system”: the cluster of things we’re calling an “epistemic immune system” is part of reality and in fact important for people’s stability and thinking, but part of the puzzle of “trying to have people be able to think/be agenty/etc” has tended to have us ignore that part of things.
Rather than, say, instinctively trusting that the “immune response” is telling us something important about reality and the person’s way of thinking/grounding, one might be looking to avoid or disable the response. This feels reality-masking; like not engaging with the data that’s there in a way that moves toward greater understanding and grounding.
- 17 Jan 2020 10:17 UTC; 29 points) 's comment on Reality-Revealing and Reality-Masking Puzzles by (
I think there’s an important thing to note, if it doesn’t already feel obvious: the concept of instrumental convergence applies to roughly anything that exhibits consequentialist behaviour, i.e. anything that does something like backchaining in its thinking.
Here’s my attempt at a poor intuitionistic proof:
If you have some kind of program that understands consequences or backchains or etc, then perhaps it’s capable of recognizing that “acquire lots of power” will then let it choose from a much larger set of possibilities. Regardless of the details of how “decisions” are made, it seems easy for the choice to be one of the massive array of outcomes possible once you have control of the light-cone, made possible by acquiring power. And thus I’m worried about “instrumental convergence”.
---
At this point, I’m already much more worried about instrumental convergence, because backchaining feels damn useful. It’s the sort of thing I’d expect most competent mind-like programs to be using in some form somewhere. It certainly seems more plausible to me that a random mind does backchaining, than a random mind looks like “utility function over here” and “maximizer over there”.
(For instance, even setting aside how AI researchers are literally building backchaining/planning into RL agents, one might expect most powerful reinforcement learners to benefit a lot from being able to reason in a consequentialist way about actions. If you can’t literally solve your domain with a lookup table, then causality and counterfactuals let you learn more from data, and better optimize your reward signal.)
---
Finally, I should point at some relevant thinking around how consequentialists probably dominate the universal prior. (Meaning: if you do an AIXI-like random search over programs, you get back mostly-consequentialists). See this post from Paul, and a small discussion on agentfoundations.
I think your pushback is ignoring an important point. One major thing the big contributors have in common is that they tend to be unplugged from the stuff Valentine is naming!
So even if folks mostly don’t become contributors by asking “how can I come more truthfully from myself and not what I’m plugged into”, I think there is an important cluster of mysteries here. Examples of related phenomena:
Why has it worked out that just about everyone who claims to take AGI seriously is also vehement about publishing every secret they discover?
Why do we fear an AI arms race, rather than expect deescalation and joint ventures?
Why does the industry fail to understand the idea of aligned AI, and instead claim that “real” alignment work is adversarial-examples/fairness/performance-fine-tuning?
I think Val’s correct on the point that our people and organizations are plugged into some bad stuff, and that it’s worth examining that.