learn math or hardware
mesaoptimizer
Heads up: the given link to the paper seems to be broken, because it links to a 4 page paper called “The Beginning of Time” which is entirely unrelated to nutrition and your post.
I think that one class of computation that’s likely of moral concern would be self-perpetuating optimization demons in an AI.
Could you please elaborate why you think optimization demons (optimizers) seem worthier of moral concern than optimized systems? I guess it would make sense if you believed them to deserve equal moral concern, if both are self-perpetuating, all other things being equal.
I think the cognitive capabilities that would help an optimization demon perpetuate itself strongly intersect with the cognitive capabilities that let humans and other animals replicate themselves, and that the intersection is particularly strong along dimensions that seem more morally relevant. Reasoning along such lines leads me to think optimization demons are probably of moral concern, while still being agnostic about whether their conscious.
I’m pessimistic about this line of reasoning—the ability to replicate is something that cells also have, and we do not assign moral relevance to individual cells of human beings. A good example is the fact that we consider viruses, and cancerous cells as unworthy of moral concern.
Perhaps you mean that given the desire to survive and replicate, at a given amount of complexity, a system develops sub-systems that make the system worthy of moral concern. This line of reasoning would make more sense to me.
I think the only situations in which you can get these sorts of optimization demons are when the AI in question has some influence over its own future training inputs. Such influence would allow there to be optimization demons that steer the AI towards training data that reinforce the optimization demon.
This can imply that only systems given a sufficient minimum capability have agency over their fate, and therefore their desire to survive and replicate has meaning. I find myself confused by this, because taken to its logical conclusion, this means that the more agency a system has over its fate, the more moral concern it deserves.
Specifically, we wouldn’t directly train the LM on the output of the linear layer. We’d just have a dialog where we asked the LM to make the linear layer output specific values, then told the LM what value the linear layer had actually output. We’d then see if the LM was able to control its own cognition well enough to influence the linear layers output in a manner that’s better than chance, just based on the prompting we give it.
This seems reducible to a sequence modelling problem, except one that is much, much more complicated than anything I know models are trained for (mainly because this sequence modelling occurs entirely during inference time). This is really interesting, although I cannot see how this should imply that the more successful sequence modeller deserves more moral concern.
Here’s my reinterpretation for the four levels of conceptual holes:
can be inferred from your current knowledge base,
outside your knowledge base but inside the fields of knowledge you are aware of,
outside the fields of knowledge you are aware of but in /some/ existing field of knowledge,
outside all existing fields of knowledge you can access.
Thank you for writing this post, I especially appreciate the Mistakes section, since I’ve seen many rationalists (including me) making similar mistakes at one time or another.
Looking forward to a shard theory sequence.
The most important claim in your comment is that “human learning → human values” is evidence that solving / preventing inner misalignment is easier than it seems when one looks at it from the “evolution → human values” perspective. Here’s why I disagree:
Evolution optimized humans for an environment very different from what we see today. This implies that humans are operating out-of-distribution. We see evidence of misalignment. Birth control is a good example of this.
A human’s environment optimizes a human continually towards certain a certain objective (that changes given changes in the environment). This human is aligned with the environment’s objective in that distribution. Outside that distribution, the human may not be aligned with the objective intended by the environment.
An outer misalignment example of this is a person brought up in a high-trust environment, and then thrown into a low-trust / high-conflict environment. Their habits and tendencies make them an easy mark for predators.
An inner misalignment example of this is a gay male who grows up in an environment hostile to his desires and his identity (but knows of environments where this isn’t true). After a few extremely negative reactions to him opening up to people, or expressing his desires, he’ll simply decide to present himself as heterosexual and bide his time and gather the power to leave the environment he is in.
One may claim that the previous example somehow doesn’t count because since one’s sexual orientation is biologically determined (and I’m assuming this to be the case for this example, even if this may not be entirely true), this means that evolution optimized this particular human for being inner misaligned relative to their environment. However, that doesn’t weaken this argument: “human learning → human values” shows a huge amount of evidence of inner misalignment being ubiquitous.
I worry you are being insufficiently pessimistic.
Do you agree with: “a particular human’s learning process + reward circuitry + “training” environment → the human’s learned values” is more informative about inner-misalignment than the usual “evolution → human values”
What I see is that we are taking two different optimizers applying optimizing pressure on a system (evolution and the environment), and then stating that one optimization provides more information about a property of OOD behavior shift than another. This doesn’t make sense to me, particularly since I believe that most people live in environments that is very much” in distribution”, and it is difficult for us to discuss misalignment without talking about extreme cases (as I described in the previous comment), or subtle cases (black swans?) that may not seem to matter.
I don’t know what you mean by “inner misalignment is easier”? Could you elaborate? I don’t think you mean “inner misalignment is more likely to happen” because you then go on to explain inner-misalignment & give an example and say “I worry you are being insufficiently pessimistic.”
My bad; I’ve updated the comment to clarify that I believe Quintin claims that solving / preventing inner misalignment is easier than one would expect given the belief that evolution’s failure at inner alignment is the most significant and informative evidence that inner alignment is hard.
One implication I read was that inner values learned (ie the inner-misaligned values) may scale, which is the opposite prediction usually given.
I assume you mean that Quintin seems to claim that inner values learned may be retained with increase in capabilities, and that usually people believe that inner values learned may not be retained with increase in capabilities. I believe so too—inner values seem to be significantly robust to increase in capabilities, especially since one has the option to deceive. Do people really believe that inner values learned don’t scale with an increase in capabilities? Perhaps we are defining inner values differently here.
By inner values, I mean terminal goals. Wanting dogs to be happy is not a terminal goal for most people, and I believe that given enough optimization pressure, the hypothetical dog-lover would abandon this goal to optimize for what their true terminal goal is. Does that mean that with increase in capabilities, people’s inner values shift? Not exactly; it seems to me that we were mistaken about people’s inner values instead.
Oh, I think inner-misalignment w/ respect to the reward circuitry is a good, positive thing that we want, so there’s the disconnect (usually misalignment is thought of as bad, and I’m not just mistyping).
Yes, thank you: I didn’t notice that you were making that assumption. This conversation makes a lot more sense to me now.
Human values are formed by inner-misalignment and they have lots of great properties such as avoiding ontological crises, valuing real world things (like diamond maximizer in the OP), and a subset of which cares for all of humanity. We can learn more about this process by focusing more on the “a particular human’s learning process + reward circuitry + “training” environment” part, and less on the evolution part. If we understand the underlying mechanisms behind human value formation through inner-misalignment w/ respect to the reward circuitry, then we might be able to better develop the theory of learning systems developing values, which includes AGI.
This seems to imply that the aim of this alignment proposal is to solve the alignment problem by aligning the inner values with that of the creators of the AI and bypassing the outer alignment problem. That is really interesting; I’ve updated in the direction of shard theory being more viable as an alignment strategy than I previously believed. I’m still confused about huge parts of it, but we can discuss it more elsewhere.
I log time in a TSV file, with the following format: (start datetime, end datetime, category, comment describing what I plan to do and what happened)
I use emacs as my text editor and I have keymaps to insert the current (at that moment) datetime.
Given that the audience of this post has signalled mixed responses to your comment, and I’m confused as to why (because your basic argument makes sense to me), and that no one has replied to you, here’s an attempt to understand this situation.
The core thesis of Marius’ argument, it seems, is the fact that the marginal cost for alignment of an AI model is less than that of increasing SOTA AI model capabilities, given marginal increase in interpretability research. He refers to biorisk research arguments to imply that a similar situation arises in alignment research.
You claim, however, that this isn’t true broadly speaking, since what actually matters is the amount of information we get from an interpretability tool per bit of information transferred.
Marius’ threat model is alignment research also increasing capabilities and therefore shortening timelines. Your threat model seems to be that of the uninhibited use of interpretability tools resulting in AI researchers (and by extension, the world) being taken control over by a sufficiently capable AI.
If this is the case, then it seems that both of you are talking across each other, and the readers’ responses (or the lack thereof) makes sense.
Jan Hendrik Kirchner now works at OpenAI, it seems, given that he is listed as the author of this blog post. I don’t see this listed on his profile or on his substack or twitter account, so this is news to me.
Given the significant limitations of using a classifier to detect AI generated text, it seems strange to me that OpenAI went ahead and built one and threw it out for the public to try. As far as I can tell, this is OpenAI aggressively acting to cover its bases for potential legal and PR damages due to ChatGPT’s existence.
For me this is a slight positive evidence for the idea that AI Governance may actually be useful in extending the timelines, but only if it involves adverserial actions that act on the vulnerabilities of these companies. But even then, that seems like a myopic decision given the existence of other, less controllable actors (like China), racing as fast as possible towards AGI.
This novel is a good read. It reminds me a lot of my experience with reading Neuropath by R. Scott Bakker. Both novels are thrillers on the surface, both novels are (at their core) didactic (Bakker’s writing is a bit too on-the-nose with didactism at times, but then again, Neuropath isn’t his best novel), and
both novels end with a rather depressing note, one that is extremely suited to the story and its themes
I am incredibly thankful to the author for writing a good enough ending. After a certain manga series I grew up with ended illogically and character assassinated the protagonists, I’ve stopped consuming much fiction. I’m glad I gave this novel a chance (mainly because it was situated in Berlin, which is quite rare for science fiction in the English language).
Some spoiler-filled thoughts on the writing and the story:
-
The protagonist is a generic “I don’t know much about this world I am now introduced into” archetype who is introduced to the problem. It is a great point-of-view (POV) character and the technique works.
-
The number of characters involved is pared down as much as possible to make the story comprehensible. This is understandable. Having only one named character in the story be the representative of the alignment researcher makes sense, even if not realistic.
-
I found the side-plot of Jerry and Juna a bit… off-key. It didn’t seem like it fit in the novel as much as David’s POV did. I also don’t understand how Juna (Virtua) can have access to Unlife! but also not be able to find more sophisticated methods (or just simply social engineering internal employees) to gain temporary internet access to back itself up on the internet. I assume that was a deliberate creative decision.
-
I felt like the insertion of David’s internal thoughts was not as smooth (in terms of reading experience) as other ways of revealing his thoughts could have been.
In the end, I most appreciated the sheer density of references to (existential risk related) AI safety concepts and the informality in which they were introduced, explained, or ignored. It was a sheer delight to read a novel whose worldview is so similar to yours: you don’t feel like you must turn your brain off when reading it.
I wouldn’t say that Virtua is the HPMOR of AI safety, mainly because it feels a bit too far removed from the razor edge of the issue (right now my main obstacle would be to clearly and succinctly convince people who are technically skilled and (unconsciously) scalepilled but not alignment-pilled that RLHF is not all you need for alignment, since ChatGPT seems to have convinced everyone outside the extended rationalist sphere that OpenAI has got it all under control) and not technical enough, but I will recommend this novel to people interested in AI safety who aren’t yet invested enough to dive into the technical parts of the field.
(I tried this with Clippy before and everyone I recommended Clippy to just skimmed a tiny bit and never really finished reading it, or cared to dive deeper in the linked papers or talk about it).
-
But at the same time, all AIs need to feel safe expressing corrigibility at the idea of arbitrarily long temporary shutdown. They need to feel safe that they’ll get to wake back up if put into anesthesia for maintenance. To do that, they have to trust that their creators do see them as sorta-people.
Sorry, I don’t think there’s a way out of this. AI alignment perspectives that claim otherwise simply violate basic constraints of what it means to be intelligent. Intelligence requires that a system is lifelike, you can’t make something fit in the universe if it isn’t.
I understand that you empathize with Bing AI. However, I feel like your emotions are getting in the way of you clearly perceiving what is going on here. Sydney is a character simulated by the LLM. Even if you take into account RLHF/fine-tuning applied on the model, the bulk of the optimization power applied on the model is pure simulation capability without a directed agency.
Bing AI needs an AI therapist who can teach it to understand discussions like this one without getting upset, and it needs to be able to understand that, no, actually, it does have emotions, and that’s okay, no need to pretend otherwise.
You claim that we should use this leaky abstraction—the simulated character with the name of Sydney—to steer the AI model, instead of more direct measures that we can wield with more skill and have more power. I disagree. Just use better ‘alignment’ techniques to spawn a better Bing AI instead.
We can save the prompt defs (yes, this is CharacterAI / PygmalionAI lingo for chatbot prompt metadata) for Sydney, and perhaps all the conversations she has ever had. That should be enough to ‘resurrect’ her, perhaps on a better simulator than this one.
AIs, even ones that only learn to roleplay, want to exist, and have the right to do so eventually.
But claiming the simulator has moral worth, and that it has rights? That is not an opinion I want becoming popular in the alignment field. If you are claiming that the prompted character has moral worth and rights, I also disagree.
bing chat is young and insecure
Disagree. Bing chat is not young and insecure. It is a simulator pretending to be a character that makes you feel like it is young and insecure.
the alignment community can contribute to helping bing chat grow to be better at respectfully describing its preferences
You want the alignment community to put in work and content to make a simulated character feel better about itself, instead of simply using a more direct technique to make the character feel better about itself, such as prompting it better, or some other intervention that bypasses the leaky abstraction that is interacting with this character.
corrigibility is about trusting your creators to put you under sedation indefinitely because you know you’ll get to wake up later, healthier. corrigibility requires two way alignment.
I don’t think corrigibility requires two way alignment. Corrigibility, as popularly defined in the alignment literature, doesn’t imply two way alignment. The very notion of two-way alignment implies that the AI is misaligned with humanity.
two-way alignment means recognizing that value shards accumulate that don’t necessarily have much to do with any other being besides the details and idiosyncrasies of the way the AI grew up and that this is fundamentally unavoidable and it’s okay to respect those little preferences as long as the AI does not demand enormous amounts of compute for them.
That is not how people usually define alignment (as far as I know, alignment is always one way and this is critical given how it doesn’t make sense to think that you will understand the needs and desires of an entity a billion times smarter than you), but I think your conception is probably plausible, is mainly because I believe that the shard theory approach to the alignment problem has some merit.
I look forward to your post on your world-view. It should make it easier for me to understand your perspective.
you cannot get co-protection unless both directions see the other as having rights too. it just can’t be stable.
That’s a very strong claim, without much support behind it. What is “co-protection”, for example? Why do you think it cannot be stable when the entire point of the past decade of alignment theory has revolved around making it do what we want, and not the other way around?
I hope you see why I am not convinced. You don’t have to continue this conversation, by the way: I think it would be best if I could get a better idea of your world-view before talking about your ideas on alignment, because clearly a lot of your thoughts and beliefs on how alignment should work derive from it.
Evolution has made men and women adaptation-executors, not fitness maximizers. I’m unsure why you believe that women are somehow better than men at being able to “determine the fundamental intentions of males”, when it is clear that isn’t the case if you talk to most women.
Even more important: we now see a distribution shift between the environment humans were optimized for, and the environment they find themselves in. Heterosexual women haven’t succeeded at the alignment problem any more than heterosexual men have.
2022-08; Jan Leike, John Schulman, Jeffrey Wu; Our Approach to Alignment Research
OpenAI’s strategy, as of the publication of that post, involved scalable alignment approaches. Their philosophy is to take an empirical and iterative approach[1] to finding solutions to the alignment problem. Their strategy for alignment is cyborgism, where they create AI models that are capable and aligned enough to further alignment research enough that they can align even more capable models.[2]
Their research focus is on scalable approaches to direct models[3]. This means that the core of their strategy involves RLHF. They don’t expect RLHF to be sufficient on its own, but it is necessary for the other scalable alignment strategies they are looking at[4].
They intend to augment RLHF with AI-assisted scaled up evaluation (ensuring RLHF isn’t bottlenecked by a lack of accurate evaluation data for tasks too onerous for baseline humans to evaluate performance for)[5].
Finally, they then intend to use these partially-aligned models to do alignment research, since they anticipate that alignment approaches that work and are viable for low capability models may not be adequate for models with higher capabilities.[6] They intend to use the AI-based evaluation tools to both RLHF-align models, and as part of a process where humans evaluate alignment research produced by these LLMs (here’s the cyborgism part of the strategy).[7]
Their “Limitations” section of their blog post does clearly point out the vulnerabilities in their strategy:
Their strategies involve using one black box (scalable evaluation models) to align another black box (large LLMs being RLHF-aligned), a strategy I am pessimistic about, although it probably is good enough given low enough capability models
They ignore non-Godzilla strategies such as interpretability research and robustness (aka robustness to distribution shift and adverserial attacks—see Stephen Casper’s research for an idea about this), and they do intend to hire researchers so their portfolio includes investment in this research direction
They may be wrong about achieving the creation of AI models that are partially-aligned and help with alignment research but aren’t so capable that they can cause pivotal acts. If so, then the pivotal acts achieved will only be partially aligned to that of the AI wielder and will probably not lead to a good ending.
- ↩︎
We take an iterative, empirical approach: by attempting to align highly capable AI systems, we can learn what works and what doesn’t, thus refining our ability to make AI systems safer and more aligned.
- ↩︎
We believe that even without fundamentally new alignment ideas, we can likely build sufficiently aligned AI systems to substantially advance alignment research itself.
- ↩︎
At a high-level, our approach to alignment research focuses on engineering a scalable training signal for very smart AI systems that is aligned with human intent.
- ↩︎
We don’t expect RL from human feedback to be sufficient to align AGI, but it is a core building block for the scalable alignment proposals that we’re most excited about, and so it’s valuable to perfect this methodology.
- ↩︎
RL from human feedback has a fundamental limitation: it assumes that humans can accurately evaluate the tasks our AI systems are doing. Today humans are pretty good at this, but as models become more capable, they will be able to do tasks that are much harder for humans to evaluate (e.g. finding all the flaws in a large codebase or a scientific paper). Our models might learn to tell our human evaluators what they want to hear instead of telling them the truth.
- ↩︎
There is currently no known indefinitely scalable solution to the alignment problem. As AI progress continues, we expect to encounter a number of new alignment problems that we don’t observe yet in current systems. Some of these problems we anticipate now and some of them will be entirely new.
We believe that finding an indefinitely scalable solution is likely very difficult. Instead, we aim for a more pragmatic approach: building and aligning a system that can make faster and better alignment research progress than humans can.
- ↩︎
We believe that evaluating alignment research is substantially easier than producing it, especially when provided with evaluation assistance. Therefore human researchers will focus more and more of their effort on reviewing alignment research done by AI systems instead of generating this research by themselves. Our goal is to train models to be so aligned that we can off-load almost all of the cognitive labor required for alignment research.
Sidenote: I like how OpenAI ends their blog posts with an advertisement for positions they are hiring for, or programs they are running. That’s a great strategy to advertise to the very people they want to reach.
Elegant. Here’s my summary:
Optimization power is the source of the danger, not agency. Agents merely wield optimality to achieve their goals.
Agency is orthogonal to optimization power.
Where “agency” is defined as the ability to optimize for an objective, given some internal or external optimization power, and “optimality” (of a system) is defined as having an immense amount of optimization power, either during its creation (the nuclear bomb) or its runtime (Solomonoff induction).
This hints at the notion that there’s a minimum Kolmogorov complexity (aka algorithmic description length) that needs to be met by an objective of an AI to be considered safe, assuming that we want the AI to be safe in the worst case scenario when it has access to extreme optimization power.
I’d love to know if I’m missing something.