mesaoptimizer
Assuming an inner aligned AI system (that is, an AI system with no misaligned inner optimizers), if we have a goal described in a way that is robust to ontological shifts due to the Natural Abstractions Hypothesis holding in some way (specifically, what I have in mind is formally specified goals like QACI, since I expect that mathematical abstractions are robust to ontological shifts), then one can simply[1] provide an this AI system this goal and allow it to do whatever it considers necessary to maximize that goal.
I do not believe this alignment strategy requires a control feedback loop at all. And I do believe that retaining control over an AI as it rapidly improves capabilities is perhaps a quixotic goal.
So no, I am not pointing at the distinction between ‘implicit/aligned control’ and ‘delegated control’ as terms used in the paper. From the paper:
Delegated control agent decides for itself the subject’s desire that is long-term-best for the subject and acts on it.
Well, in the example given above, the agent doesn’t decide for itself what the subject’s desire is: it simply optimizes for its own desire. The work of deciding what is ‘long-term-best for the subject’ does not happen unless that is actually what the goal specifies.
- ↩︎
For certain definitions of “simply”.
- ↩︎
Also intuitively, in the latter case 5 of the data points “didn’t matter” in that you’d have had the same constraints (at that point) without them, and so this is kinda sorta like “information loss”.
I am confused: how can this be “information loss” when we are assuming that due to linear dependence of the data points, we necessarily have 5 extra dimensions where the loss is the same? Because 5 of the data points “didn’t matter”, that shouldn’t count as “information loss” but more like “redundant data, ergo no information transmitted”.
Control methods are always implemented as a feedback loop.
Most of my foom scenarios do not involve humanity trying to retain control over an AI, but trying to align it such that when humanity loses control, the outcomes are still favorable for humanity (or at least, the seed that represents humanity is what is maximized).
They are also not allowed to tell each other their true goals, and are ordered to eliminate the other if they tell them their goals. Importantly these rules also happen to allow them to have arbitrary sub goals as long as they are not a threat to humanity.
If we can steer an AI to an extent where they will follow such an arbitrary rule that we provide them, we can fully align AIs too with the tools we use to make it do such a thing.
Therefore An can properly align A_{n+1} . The base case is simply a reasonable human being who is by definition aligned. Therefore A_n can be aligned for all n.
They key word that confuses me here seems to be “align”. How exactly does properly align ? How does a human being align a GPT-2 model, for example? What does “align” even mean here?
My bad. I’m glad to hear you do have an inside view of the alignment problem.
If knowing enough about ML is your bottleneck, perhaps that’s something you can directly focus on? I don’t expect it to be hard for you—perhaps only about six months—to get to a point where you have coherent inside models about timelines.
Part of the reason I’m considering getting a degree is so I can get a job if I want and not have to bet on living rent-free with other rationalists or something.
Yeah, that’s a hard problem. You seem smart: have you considered finding rationalists or rationalist-adjacent people who want to hire you part-time? I expect that the EA community in particular may have people willing to do so and that would give you both experience (to show future employers / clients), connections (to find more part-time / full-time jobs), and money.
Now that I think about it though, I probably overestimated how long the timelines of optimistic alignment researchers were so it’s probably more like 2040.
You just updated towards shortening your timelines by a decade due to what would be between 5 minutes to half an hour of tree-of-thought style reflection. Your reasoning seems entirely social (that is, dependent on other people’s signalled beliefs) too, which is not something I would recommend if you want to do useful alignment research.
The problem with relying on social evidence for your beliefs about scientific problems is that you both end up with bad epistemics and end up taking negative expected value actions. First: if other people update their beliefs due to social evidence the same way you do, you are vulnerable to a cascade of belief changes (mundane examples: tulip craze, cryptocurrency hype, NFT hype, cult beliefs) in your social community. This is even worse for the alignment problem because of the significant amount of disagreement in the alignment research community itself about details about the problem. Relying on social reasoning in such an epistemic environment will leave you constantly uncertain due to how uncertain you percieve the community is about core parts of the problem. Next: if you do not have inside models of the alignment problem, you shall fail to update accurately given evidence about the difficulty about the problem. Even if you rely on other researchers who have inside / object-level models and update accurately, there is bound to be disagreement between them. Who do you decide to believe?
The first thing I recommend you do is to figure out your beliefs and model of the alignment problem using reasoning at the object-level, without relying on what anyone else thinks about the problem.
2050? That’s quite far off, and it makes sense that you are considering university given you expect to have about two decades.
Given such a scenario, I would recommend trying to do a computer science/math major, specifically focusing on the subjects listed in John Wentworth’s Study Guide that you find interesting. I expect that three years of such optimized undergrad-level study will easily make someone at least SERI MATS scholar level (assuming they start out a high school student). Since you are interested in agent foundations, I expect you shall find John Wentworth’s recommendations more useful since his work seems close to (but not quite) agent foundations.
Given your timelines, I expect doing an undergrad (that is, a bachelor’s degree) would also give you traditional credentials, which are useful to survive in case you need a job to fund yourself.
Honestly, I recommend you simply dive right in if possible. One neglected but extremely useful resource I’ve found is Stampy. The AGI Safety Fundamentals technical course won’t happen until September, it seems, but perhaps you can register your interest for it. You can begin reading the curriculum—at least the stuff you aren’t yet familiar with—almost immediately. Dive deep into the stuff that interests you.
Well, I assume you have already done this, or something close to this, and if that is the case, you can ignore the previous paragraph. If possible, could you go into some detail as to why you expect we will get a superintelligence at around 2050? It seems awfully far to me, and I’m curious as to the reasoning behind your belief.
Sorry for the late reply: I wrote up an answer but due to a server-side error during submission, I lost it. I shall answer the interpretability question first.
Interpretability didn’t make the list because of the following beliefs of mine:
Interpretability—specifically interpretability-after-training—seems to aim, at the limit, for ontology identification, which is very different from ontological robustness. Ontology identification is useful for specific safety interventions such as scalable oversight, which seems like a viable alignment strategy, but I doubt this strategy scales until ASI. I expect it to break almost immediately as someone begins a human-in-the-loop RSI, especially since I expect (at the very least) significant changes in the architecture of neural network models that would result in capability improvements. This is why I predict that investing in interpretability research is not the best idea.
A counterpoint is the notion that we can accelerate alignment with sufficiently capable aligned ‘oracle’ models—and this seems to be OpenAI’s current strategy: build ‘oracle’ models that are aligned enough to accelerate alignment research, and use better alignment techniques on the more capable models. Since one can both accelerate capabilities research and alignment research with capable enough oracle models, however, OpenAI would also choose to accelerate capabilities research alongside their attempt to accelerate alignment research. The question then is whether OpenAI is cautious enough as they balance out the two—and recent events have not made me optimistic about this being the case.
Interpretability research does help accelerate some of the alignment agendas I have listed by providing insights that may be broad enough to help; but I expect that such insights to probably be found through other approaches too, and the fact that interpretability research either involves not working on more robust alignment plans or leads to capability insights, both seem to make me averse to considering working on interpretability research.
Here’s a few facets of interpretability research that I am enthusiastic about tracking, but not excited enough to want to work on, as of writing:
Interpretability-during-training probably would be really useful, and I am more optimistic about it than interpretability-after-training. I expect that at the limit, interpretability-during-training leads to progress towards ensuring ontological robustness of values.
Interpretability (both after-training and during-training) will help with detecting and making interventions when it comes to inner misalignment. That’s a great benefit, that I haven’t really thought about until I decided to reflect and answer your question.
Interpretability research seems very focused on ‘oracles’—sequence modellers and supervised learning systems—and interpretability research on RL models seems neglected. I would like to see more research done on such models, because RL-style systems seems more likely to lead to RSI and ASI, and insights we gain might help alignment research in general.
I’m really glad you asked me this question! You’ve helped me elicit (and develop) a more nuanced view on interpretability research.
There seem to be three key factors that would influence your decision:
Your belief about how valuable the problem is to work on
Your belief about how hard it is to solve this problem and how well the current alignment community is doing to solve the problem
Your belief about how long we have until we run out of time
Based on your LW comment history, you probably already have rough models about the alignment problem that inform these three beliefs of yours. I think it would be helpful if you could go into detail about them so people can give you more specific advice, or perhaps help you answer another question further upstream of the one you asked.
Causal Influence Diagrams are interesting, but don’t really seem all that useful. Anyway, the latest formal graphical representation for agents that the authors seem to promote are structured causal models so you don’t read this paper for object level usefulness but incidental research contributions that are really interesting.
The paper divides AI systems into two major frameworks:
MDP-based frameworks (aka RL-based systems such as AlphaZero), which involve AI systems that take actions and are assigned a reward for their actions
Question-answering systems (which includes all supervised learning systems, including sequence modellers like GPT), were the system gives an output given an input and is scored based on a label of the same data type as the output. This is also informally known as tool AI (they cite Gwern’s post, which is nice to see).
I liked how lucidly they defined wireheading:
In the basic MDP from Figure 1, the reward parameter ΘR is assumed to be unchanging. In reality, this assumption may fail because the reward function is computed by some physical system that is a modifiable part of the state of the world. [...] This gives an incentive for the agent to obtain more reward by influencing the reward function rather than optimizing the state, sometimes called wireheading.
The common definition of wireheading is informal enough that different people would map it to different specific formalizations in their head (or perhaps have no formalization and therefore be confused), and having this ‘more formal’ definition in my head seems rather useful.
Here’s their distillation for Current RF-optimization, a strategy to avoid wireheading (which reminds me of shard theory, now that I think about it—models that avoid wireheading by modelling effects of resulting changes to policy and then deciding what trajectory of actions to take):
An elegant solution to this problem is to use model-based agents that simulate the state sequence likely to result from different policies, and evaluate those state sequences according to the current or initial reward function.
Here’s their distillation of Reward Modelling:
A key challenge when scaling RL to environments beyond board games or computer games is that it is hard to define good reward functions. Reward Modeling [Leike et al., 2018] is a safety framework in which the agent learns a reward model from human feedback while interacting with the environment. The feedback could be in the form of preferences, demonstrations, real-valued rewards, or reward sketches. [...] Reward modeling can also be done recursively, using previously trained agents to help with the training of more powerful agents [Leike et al., 2018].
The resulting CI diagram modelling actually made me feel like I grokked Reward Modelling better.
Here’s their distillation of CIRL:
Another way for agents to learn the reward function while interacting with the environment is Cooperative Inverse Reinforcement Learning (CIRL) [Hadfield-Menell et al., 2016]. Here the agent and the human inhabit a joint environment. The human and the agent jointly optimize the sum of rewards, but only the human knows what the rewards are. The agent has to infer the rewards by looking at the human’s actions.
The difference between RM and CIRL causal influence diagrams is interesting, because there is a subtle difference. The authors imply that this minor difference matters and can imply different things about system incentives and therefore safety guarantees, and I am enthusiastic about such strategies for investigating safety guarantees.
The authors describe a wireheading-equivalent for QA systems called self-fulfilling prophecies:
The assumption that the labels are generated independently of the agent’s answer sometimes fails to hold. For example, the label for an online stock price prediction system could be produced after trades have been made based on its prediction. In this case, the QA-system has an incentive to make self-fulfilling prophecies. For example, it may predict that the stock will have zero value in a week. If sufficiently trusted, this prediction may lead the company behind the stock to quickly go bankrupt. Since the answer turned out to be accurate, the QA-system would get full reward. This problematic incentive is represented in the diagram in Figure 9, where we can see that the QA-system has both incentive and ability to affect the world state with its answer [Everitt et al., 2019].
They propose a solution to the self-fulfilling prophecies problem, via making oracles optimize for reward in the counterfactual world where their answer doesn’t influence the world state and therefore the label which they are optimized for. While that is a solution, I am unsure how one can get counterfactual labels for complicated questions whose answers may have far reaching consequences in the world.
It is possible to fix the incentive for making self-fulfilling prophecies while retaining the possibility to ask questions where the correctness of the answer depends on the resulting state. Counterfactual oracles optimize reward in the counterfactual world where no one reads the answer [Armstrong, 2017]. This solution can be represented with a twin network [Balke and Pearl, 1994] influence diagram, as shown in Figure 10. Here, we can see that the QA-system’s incentive to influence the (actual) world state has vanished, since the actual world state does not influence the QA-system’s reward; thereby the incentive to make self-fulfilling prophecies also vanishes. We expect this type of solution to be applicable to incentive problems in many other contexts as well.
The authors also anticipate this problem but instead of considering whether and how one can tractably calculate counterfactual labels, they connect this intractability to introducting the debate AI safety strategy:
To fix this, Irving et al. [2018] suggest pitting two QA-systems against each other in a debate about the best course of action. The systems both make their own proposals, and can subsequently make arguments about why their own suggestion is better than their opponent’s. The system who manages to convince the user gets rewarded; the other system does not. While there is no guarantee that the winning answer is correct, the setup provides the user with a powerful way to poke holes in any suggested answer, and reward can be dispensed without waiting to see the actual result.
I like how they explicitly mention that there is no guarantee that the winning answer is correct, which makes me more enthusiastic about considering debate as a potential strategy.
They also have an incredibly lucid distillation of IDA. Seriously, this is significantly better than all the Paul Christiano posts I’ve read and the informal conversations I’ve had about IDA:
Iterated distillation and amplification (IDA) [Christiano et al., 2018] is another suggestion that can be used for training QA-systems to correctly answer questions where it is hard for an unaided user to directly determine their correctness. Given an original question Q that is hard to answer correctly, less powerful systems Xk are asked to answer a set of simpler questions Qi. By combining the answers Ai to the simpler questions Qi, the user can guess the answer ˆA to Q. A more powerful system Xk+1 is trained to answer Q, with ˆA used as an approximation of the correct answer to Q.
Once the more powerful system Xk+1 has been trained, the process can be repeated. Now an even more powerful QA-system Xk+2 can be trained, by using Xk+1 to answer simpler questions to provide approximate answers for training Xk+2. Systems may also be trained to find good subquestions, and for aggregating answers to subquestions into answer approximations. In addition to supervised learning, IDA can also be applied to reinforcement learning.
I have no idea why they included Drexler’s CAIS—but it is better than reading 300 pages of the original paper:
Drexler [2019] argues that the main safety concern from artificial intelligence does not come from a single agent, but rather from big collections of AI services. For example, one service may provide a world model, another provide planning ability, a third decision making, and so on. As an aggregate, these services can be very competent, even though each service only has access to a limited amount of resources and only optimizes a short-term goal.
The authors claim that the AI safety issues commonly discussed can be derived ‘downstream’ of modelling these systems more formally, using these causal influence diagrams. I disagree, due to the amount of degrees of freedom the modeller is given when making these diagrams.
In the discussion section, the authors talk about the assumptions underlying the representations, and their limitations. They explicitly point out how the intensional stance may be limiting and not model certain classes of AI systems or agents (hint: read their newer papers!)
Overall, the paper was an easy and fun read, and I loved the distillations of AI safety approaches in them. I’m excited to read papers by this group.
When I referred to pivotal acts, I implied the use of enforcement tools that are extremely powerful, of the sort implied in AGI Ruin. That is, enforcement tools that make an actual impact in extending timelines[1]. Perhaps I should start using a more precise term to describe this from now on.
It is hard for me to imagine how there can be consensus within a US government organization capable of launching a superhuman-enforcement-tool-based pivotal act (such as three letter agencies) to initiate a moratorium, much less consensus in the US government or between US and EU (especially given the rather interesting strategy EU is trying with their AI Act).
I continue to consider all superhuman-enforcement-tool-based pivotal acts as unilateral given this belief. My use of the world “unilateral” points to the fact that the organizations and people who currently have a non-trivial influence over the state of the world and its future will almost entirely be blindsided by the pivotal act, and that will result in destruction of trust and chaos and an increase in conflict. And I currently believe that this is actually more likely to increase P(doom) or existential risk for humanity, even if it extends the foom timeline.
- ↩︎
Although not preventing ASI creation entirely. The destruction of humanity’s potential is also an existential risk, and the inability for us to create a utopia is too painful to bear.
- ↩︎
Your question seems to focus mainly on timeline model and not alignment model, so I shall focus on explaining how my model of the timeline has changed.
My timeline shortened from about four years (mean probability) to my current timeline of about 2.5 years (mean probability) since the GPT-4 release. This was because of two reasons:
gut-level update on GPT-4′s capability increases: we seem quite close to human-in-the-loop RSI.
a more accurate model for bounds on RSI. I had previously thought that RSI would be more difficult than I think it is now.
The latter is more load-bearing than the former, although my predictions for how soon AI labs will achieve human-in-the-loop RSI creates an upper bound on how much time we have (assuming no slowdown), which is quite useful when making your timeline.
My current theory of change to mitigate existential risk by misaligned ASI
Formatting error: “OK, I used to work for a robotics company, and I do think that one of the key obstacles for a hostile AI is moving atoms around. So let me propose some alarms!” should be quoted since it is not you (Zvi) writing that passage but the person you linked and are quoting.
Possible typos:
“I kind of feel like if you are the one building the DoNotPlay chat, [...]” should be “DoNotPay” instead.
“Joshua gets ten out of ten for the central point, then (as I score it) gets either minus a million for asking the wrong questions.” the “either” is not followed by two objects
Spent about 45 minutes processing this mentally. Did not look through the code or wonder about the reliability of the results. Here are my thoughts:
Why ask an AI to shut down if it recognizes its superiority? If it cannot become powerful enough for humans to handle, it cannot become powerful enough to protect humans from another AI that is too powerful for humans to handle.
Based on what I can tell, AP fine-tuning will lead to the AI more likely simulating the relevant AP and its tokens will be what the simulator thinks the AP would return next. This means it is brittle to systems that leverage this model since they can simply beam search and ignore the shutdown beams. RLHF-like fine-tuning strategies probably perform better, according to my intuition.
-
How successful is this strategy given increasing scale of LLMs and its capabilities? If this was performed on multiple scales of GPT-2 , it would provide useful empirical data about robustness to scale. My current prediction is that this is not robust to scale given that you are fine-tuning on stories to create personas. The smarter the model is, the more likely it is to realize when it is being tested to provide the appropriate “shutdown!” output and pretend to be the AP, and in out-of-distribution scenarios, it will pretend to be some other persona instead.
-
The AP finetuned model seems vulnerable to LM gaslighting the same way ChatGPT is. This does not seem to be an improvement over OAI’s Instruct fine-tuning or whatever they did to GPT-4.
I apologize for not interacting with certain subsets of your post that you may consider relevant or significant as a contribution. That is mainly because I think their significance is downstream of certain assumptions you and I disagree about.
I agree, and believe it would have been useful if Jacob (post author) had made this clear in the opening paragraph of the post. I see no point in reading the post if it does not measurably impact my foom/doom timeline probability distribution.
I am interested in his doom scenario, however.
I want to differentiate between categories of capabilities improvement in AI systems, and here’s the set of terms I’ve come up with to think about them:
-
Infrastructure improvements: Capability boost in the infrastructure that makes up an AI system. This involves software (Pytorch, CUDA), hardware (NVIDIA GPUs), operating systems, networking, the physical environment where the infrastructure is situated. This probably is not the lowest hanging fruit when it comes to capabilities acceleration.
-
Scaffolding improvements: Capability boost in an AI system that involves augmenting the AI system via software features. Think of it as keeping the CPU of the natural language computer the same, but upgrading its RAM and SSD and IO devices. Some examples off the top of my head: hyperparameter optimization for generating text, use of plugins, embeddings for memory. More information is in beren’s essay linked in this paragraph.
-
Neural network improvements: Any capability boost in an AI system that specifically involves improving the black-box neural network that drives the system. This is mainly what SOTA ML researchers focus on, and is what has driven the AI hype over the past decade. This can involve architectural improvements, training improvements, finetuning afterwards (RLHF to me counts as capabilities acceleration via neural network improvements), etc.
There probably are more categories, or finer ways to slice the space of capability acceleration mechanisms, but I haven’t thought about this in as much detail yet.
As far as I can tell, both capabilities augmentation and capabilities acceleration contribute to achieving recursive self-improving (RSI) systems, and once you hit that point, foom is inevitable.
-
Your text here is missing content found in the linked post. Specifically, the sentence “If one has to do this with” ends abruptly, unfinished.
Could you link (or describe) a better explanation for why you believe that the Natural Abstraction Hypothesis (or a goal described in a way that is robust to ontological shifts; I consider both equivalent) is not a sound assumption? Because in such a case I believe we are mostly doomed. I don’t expect the ‘control problem’ to be solvable or consider that it makes sense for humanity to be able to have a leash on something superintelligent that can have a shift in its preferences.