PhD candidate in sociology and statistics at the University of Chicago. Co-founder of the Sentience Institute. Word cloud: agency, benchmarks, causality, digital minds, generalization, HCI, moral circle expansion, NLP, RLHF, robustness
Jacy Reese Anthis
Public Opinion on AI Safety: AIMS 2023 and 2021 Summary
Thanks for writing this, Nate. This topic is central to our research at Sentience Institute, e.g., “Properly including AIs in the moral circle could improve human-AI relations, reduce human-AI conflict, and reduce the likelihood of human extinction from rogue AI. Moral circle expansion to include the interests of digital minds could facilitate better relations between a nascent AGI and its creators, such that the AGI is more likely to follow instructions and the various optimizers involved in AGI-building are more likely to be aligned with each other. Empirically and theoretically, it seems very challenging to robustly align systems that have an exclusionary relationship such as oppression, abuse, cruelty, or slavery.” From Key Questions for Digital Minds.
AGI goal space is big, but narrowing might not be as hard as it seems.
Natural language alignment
I disagree with Eliezer Yudkowsky on a lot, but one thing I can say for his credibility is that in possible futures where he’s right, nobody will be around to laud his correctness, and in possible futures where he’s wrong, it will arguably be very clear how wrong his views were. Even if he has a big ego (as Lex Fridman suggested), this is a good reason to view his position as sincere and—dare I say it—selfless.
Key Questions for Digital Minds
In particular, I wonder if many people who won’t read through a post about offices and logistics would notice and find compelling a standalone post with Oliver’s 2nd message and Ben’s “broader ecosystem” list—analogous to AGI Ruin: A List of Lethalities. I know related points have been made elsewhere, but I think 95-Theses-style lists have a certain punch.
I like these examples, but can’t we still view ChatGPT as a simulator—just a simulator of “Spock in a world where ‘The giant that came to tea’ is a real movie” instead of “Spock in a world where ‘The giant that came to tea’ is not a real movie”? You’re already posing that Spock, a fictional character, exists, so it’s not clear to me that one of these worlds is the right one in any privileged sense.
On the other hand, maybe the world with only one fiction is more intuitive to researchers, so the simulators frame does mislead in practice even if it can be rescued. Personally, I think reframing is possible in essentially all cases, which evidences the approach of drawing on frames (next-token predictors, simulators, agents, oracles, genies) selectively as inspirational and explanatory tools, but unpacking them any time we get into substantive analysis.
Yes.
+1. While I will also respect the request to not state them in the comments, I would bet that you could sample 10 ICML/NeurIPS/ICLR/AISTATS authors and learn about >10 well-defined, not entirely overlapping obstacles of this sort.
We don’t have any obstacle left in mind that we don’t expect to get overcome in more than 6 months after efforts are invested to take it down.
I don’t want people to skim this post and get the impression that this is a common view in ML.
Interesting! I’m not sure what you’re saying here. Which of those two things (shard theory and shard theory) is shard theory (written without a subscript)? If the former, then the OP seems accurate. If the latter or if shard theory without a subscript includes both of those two things, then I misread your view and will edit the post to note that this comment supersedes (my reading of) your previous statement.
Had you seen the researcher explanation for the March 2022 “AI suggested 40,000 new possible chemical weapons in just six hours” paper? I quote (paywall):
Our drug discovery company received an invitation to contribute a presentation on how AI technologies for drug discovery could potentially be misused.
Risk of misuse
The thought had never previously struck us. We were vaguely aware of security concerns around work with pathogens or toxic chemicals, but that did not relate to us; we primarily operate in a virtual setting. Our work is rooted in building machine learning models for therapeutic and toxic targets to better assist in the design of new molecules for drug discovery. We have spent decades using computers and AI to improve human health—not to degrade it. We were naive in thinking about the potential misuse of our trade, as our aim had always been to avoid molecular features that could interfere with the many different classes of proteins essential to human life. Even our projects on Ebola and neurotoxins, which could have sparked thoughts about the potential negative implications of our machine learning models, had not set our alarm bells ringing.Our company—Collaborations Pharmaceuticals, Inc.—had recently published computational machine learning models for toxicity prediction in different areas, and, in developing our presentation to the Spiez meeting, we opted to explore how AI could be used to design toxic molecules. It was a thought exercise we had not considered before that ultimately evolved into a computational proof of concept for making biochemical weapons.
Meta AI announces Cicero: Human-Level Diplomacy play (with dialogue)
Hm, the begging question meaning is probably just a verbal dispute, but I don’t think asking questions can in general beg questions because they don’t have conclusions. There is no “assuming its conclusion is true” if there is no conclusion. Not a big deal though!
I wouldn’t say values are independent (i.e., orthogonal) at best; they are often highly correlated, such as values of “have enjoyable experiences” and “satisfy hunger” both leading to eating tasty meals. I agree they are often contradictory, and this is one valid model of catastrophic addiction or mild problems. I think any rigorous theory of “values” (shard theory or otherwise) will need to make sense of those phenomena, but I don’t see that as an issue for the claim “ensure alignment with its values” because I don’t think alignment requires complete satisfaction of every value, which is almost always impossible.
Thanks for the comment. I take “beg the question” to mean “assumes its conclusion,” but it seems like you just mean Point 2 assumes something you disagree with, which is fair. I can see reasonable definitions of aligned and misaligned in which brains would fall into either category. For example, insofar as our values are a certain sort of evolutionary (e.g., valuing reproduction), human brains have misaligned mesaoptimization like craving sugar. If sugar craving itself is the value, then arguably we’re well-aligned.
In terms of synthesizing an illusion, what exactly would make it illusory? If the synthesis (i.e., combination of the various shards and associated data) is leading to brains going about their business in a not-catastrophic way (e.g., not being constantly insane or paralyzed), then that seems to meet the bar for alignment that many, particularly agent foundations proponents, favor. See, for example, Nate’s recent post:
Unfortunately, the current frontier for alignment research is “can we figure out how to point AGI at anything?”. By far the most likely outcome is that we screw up alignment and destroy ourselves.
The example I like is just getting an AI to fill a container of water, which human brains are able to do, but in Fantasia, the sorceror’s apprentice Mickey Mouse was not able to do! So that’s a basic sense in which brains are aligned, but again I’m not sure how exactly you would differentiate alignment with its values from synthesis of an illusion.
Unpacking “Shard Theory” as Hunch, Question, Theory, and Insight
Very interesting! Zack’s two questions were also the top two questions that came to mind for me. I’m not sure if you got around to writing this up in more detail, John, but I’ll jot down the way I tentatively view this differently. Of course I’ve given this vastly less thought than you have, so many grains of salt.
On “If this is so hard, how do humans and other agents arguably do it so easily all the time?”, how meaningful is the notion of extra parameters if most agents are able to find uses for any parameters, even just through redundancy or error-correction (e.g., in case one base pair changes through exaptation or useless mutation)? In alignment, why assume that all aligned AIs “look like they work”? Why assume that these are binaries? Etc. In general, there seem to be many realistic additions to your model that mitigate this exponential-increase-in-possibilities challenge and seem to more closely fit real-world agents who are successful. I don’t see as many such additions that would make the optimization even more challenging.
On generators, why should we carve such a clear and small circle around genes as the generators? Rob mentioned the common thought experiment of alien worlds in which genes produce babies who grow up in isolation from human civilization, and I would push on that further. Even on Earth, we have Stone Age values versus modern values, and if you draw the line more widely (either by calling more things generators or including non-generators), this notion of “generators of human values” starts to seem very narrow and much less meaningful for alignment or a general understanding of agency, which I think most people would say requires learning more values than what is in our genes. I don’t think “feed an AI data” gets around this: AIs already have easy access to genes and to humans of all ages. There is an advantage to telling the AI “these are the genes that matter,” but could it really just take those genes or their mapping onto some value space and raise virtual value-children in a useful way? How do they know they aren’t leaving out the important differentiators between Stone Age and modern values, genetic or otherwise? How would they adjudicate between all the variation in values from all of these sources? How could we map them onto trade-offs suitable for coherence conditions? Etc.
This is great. One question it raises for me is: Why is there a common assumption in AI safety that values are a sort of existent (i.e., they exist) and latent (i.e., they are not directly observable) phenomena? I don’t think those are unreasonable partial definitions of “values,” but they’re far from the only ones and not at all obvious that they’re the values with which we want to align AI. Philosophers Iason Gabriel (2020) and Patrick Butlin (2021) have pointed out some of the many definitions of “values” that we could use for AI safety.
I understand that just picking an operationalization and sticking to it may be necessary for some technical research, but I worry that the gloss reifies these particular criteria and may even reify semantic issues (e.g., Which latent phenomena do we want to describe as “values”?; a sort of verbal dispute a la Chalmers) incorrectly as substantive issues (e.g., How do we align an AI with the true values?).
This is a very exciting project! I’m particularly glad to see two features: (i) the focus on “deception”, which undergirds much existential risk but has arguably been less of a focal point than “agency”, “optimization”, “inner misalignment”, and other related concepts, (ii) the ability to widen the bottleneck of upskilling novice AI safety researchers who have, say, 500 hours of experience through the AI Safety Fundamentals course but need mentorship and support to make their own meaningful research contributions.