I’m an AGI safety researcher in Boston with a particular focus on brain algorithms. See https://sjbyrnes.com/agi.html Email: steven.byrnes@gmail.com Twitter: @steve47285
Steven Byrnes
I might have a narrowly activated value against taking COVID tests…
Hmm, I think I’d say “I might feel an aversion…” there.
“Desires & aversions” would work in a context where the sign was ambiguous. So would “preferences”.
Things I agree with:
Model-based RL algorithms tend to create agents with a big mess of desires and aversions that are not necessarily self-consistent. (But see caveat about self-modification below.)
When an agent is in an environment wherein its desires can change (e.g. an RL agent being updated online by TD learning), it will tend to take foresighted actions to preserve its current desires (cf. instrumental convergence), assuming the agent is sufficiently self-aware and foresighted.
Human within-lifetime learning is an example of model-based RL, and well worth looking into in the course of trying to understand more generally how powerful model-based RL agents might behave.
Things I disagree with:
I disagree with how this post uses the word “values” throughout, rather than “desires” (or “preferences”) which (AFAICT) would be a better match to how the term is being used here.
I disagree with how this post seems to optimistically ignore the possibility that the AGI might self-modify to be more coherent in a way that involves crushing / erasing a subset of its desires, and this subset might include the desires related to human flourishing. This is analogous to how you or I might try to self-modify to erase some of our (unendorsed) desires (to eat junk food, be selfish, be cruel, etc.), if we could. (This doesn’t even require exotic things like access to source code / brain tissue; there are lots of mundane tricks to kick bad habits etc.) (Cf. also how people will “bite the bullet” in thought experiments when doing moral reasoning.)
Counterpoint: What if we try to ensure that the AGI has a strong (meta-)preference not to do that?
Response to counterpoint: Sure! That seems like a promising thing to look into. But until we have a plan to ensure that that actually happens, I’ll keep feeling like this is an “area that warrants further research” not “cause for general optimism”.
Relatedly and more broadly, my attitude to the technical alignment problem is more like “It is as yet unclear whether or not we face certain doom if we train a model-based RL agent to AGI using the best practices that AI alignment researchers currently know about” (see §14.6 here), not “We have strong reason to believe that the problem is solvable this way with no big new alignment ideas”. This makes me “optimistic” about solvability of technical alignment by the standards of, say, Eliezer, but not “optimistic” as the term is normally used. More like “uncertain”.
I guess I’m not sure if this is really a disagreement with the post. The post seems to have optimistic vibes in certain places (“if shard theory is true, meaningful partial alignment successes are possible”) but the conclusion is more cautious. ¯\_(ツ)_/¯
I think some of the detailed descriptions of (anthropomorphized) shards are misleading, but I’m not sure that really matters for anything.
We-the-devs choose when to dole out these reinforcement events, either handwriting a simple reinforcement algorithm to do it for us or having human overseers give out reinforcement.
I would unify those by saying: We get to write a reward function however we want. The reward function can depend on external inputs, if that’s what we want. One of those external inputs can be a “reward” button, if that’s what we want. And the reward function can be a single trivial line of code
return (100 if reward_button_is_being_pressed else 0)
, if that’s what we want.Therefore, “having human overseers give out [reward]” is subsumed by “handwriting a simple [reward function]”.
(I don’t think you would wind up with an AGI at all with that exact reward function, and if you did, I think the AGI would probably kill everyone. But that’s a different issue.)
(Further discussion in §8.4 here.)
(By the way, I’m assuming “reinforcement” is synonymous with reward, let me know if it isn’t.)
The standard mesa-optimization argument from Risks From Learned Optimization holds, and the system ends up developing a general-purpose (i.e. retargetable) internal search process.
It also works in the scenario where human programmers develop a general-purpose (i.e. retargetable) internal search process, i.e. brain-like AGI or pretty much any other flavor of model-based RL. You would look for things in the world-model and manually set their “value” (in RL jargon) / “valence” (in psych jargon) to very high or low, or neutral, as the case may be. I’m all for that, and indeed I bring it up with some regularity. My progress towards a plan along those lines (such as it is) is mostly here. (Maybe it doesn’t look that way, but note that “Thought Assessors” ( ≈ multi-dimensional value function) can be thought of as a specific simple approach to interpretability, see discussion of the motivation-vs-interpretability duality in §9.6 here.) Some of the open problems IMO include:
figuring out what exactly value to paint onto exactly what concepts;
dealing with concept extrapolation when concepts hit edge-cases [concepts can hit edge-cases both because the AGI keeps learning new things, and because the AGI may consider the possibility of executing innovative plans that would take things out of distribution];
getting safely through the period where the “infant AGI” hasn’t yet learned the concepts which we want it to pursue (maybe solvable with a good sandbox);
getting the interpretability itself to work well (including the first-person problem, i.e. the issue that the AGI’s own intentions may be especially hard to get at with interpretability tools because it’s not just a matter of showing certain sensory input data and seeing what neurons activate)
There are some humans who take the Simulation Hypothesis seriously, and care about what’s happening in (the presumed) basement reality. They generally don’t care much, and I’ve never heard of someone changing their life plans on that basis, but some people care a little, apparently. We can ponder why, and when we figure it out, we can transfer that understanding to thinking about AIs.
I agree with “unlikely”, but disagree on general principle that we shouldn’t spend any time thinking through unlikely optimistic scenarios. It’s not the highest-priority thing, but that’s different than saying it’s not worth doing at all.
Oops! I started to add that in a new bullet point, but then decided to put it in the figure caption instead, and then I forgot to go back and delete the bullet point. Thank you, fixed now.
I argue in Section 1.6 here that an AGI with similar capabilities as an ambitious intelligent charismatic methodical human, and with a radically nonhuman motivation system, could very plausibly kill everyone, even leaving aside self-improvement and self-replication.
(The category of “ambitious intelligent charismatic methodical humans who are explicitly, patiently, trying to wipe out all of humanity” is either literally empty or close to it, so it’s not like we have historical data to reassure us here. And the danger of such people increases each year, thanks to advancing technology e.g. in biotech.)
Autism is underdiagnosed, not overdiagnosed
Those aren’t mutually exclusive—see here for example. I have no opinion either way on whether there are lots of kids with autism who aren’t getting diagnosed with autism (I’ve never looked into it), but if that’s true, I don’t see why it would give me any reason to be skeptical of Camarata’s claims here. Can you explain?
Engaging in pretend play has nothing to do with autism. Some autistic kids don’t, some do.
“Nothing to do” is a strong claim—IIUC, you’re claiming that there’s no correlation whatsoever between autism and pretend play. My quick literature / google search strongly suggests that this is not true. Do you have a reference or anything? Or how did you come to believe that? Do you have a sense of how common this position is among experts?
Also what you describe as the specific disconnect between the brain and mouth in speech is called speech apraxia and is the main reason why there are late-talking or non-speaking autistic people.
I’m not in a position to adjudicate this debate, but FWIW Camarata claims that apraxia is vanishingly rare as a cause of delayed speech. He describes showing a video of a child with (what he calls) real speech apraxia to an audience of speech pathologists, and he said `have any of you seen a kid like this’, and everyone said ‘no, never’, and he said ‘OK so why are you guys diagnosing apraxia all the time?’ (That’s a paraphrase; I don’t have the book on-hand.) (I myself was diagnosed with speech apraxia when my parents brought me in for a diagnosis at age 2.something in the 1980s; my parents had kept the records from the hospital.)
I’d love to test this theory, please give feedback in the comments about your own work experience and thoughts on problem factorization.
Yes I too have a rant along those lines from a post a while back, here it is:
I’m generally skeptical that anything in the vicinity of factored cognition will achieve both sufficient safety and sufficient capability simultaneously, for reasons similar to Eliezer’s here. For example, I’ll grant that a team of 10 people can design a better and more complex widget than any one of them could by themselves. But my experience (from having been on many such teams) is that the 10 people all need to be explaining things to each other constantly, such that they wind up with heavily-overlapping understandings of the task, because all abstractions are leaky. And you can’t just replace the 10 people with 100 people spending 10× less time, or the project will absolutely collapse, crushed under the weight of leaky abstractions and unwise-in-retrospect task-splittings and task-definitions, with no one understanding what they’re supposed to be doing well enough to actually do it. In fact, at my last job, it was not at all unusual for me to find myself sketching out the algorithms on a project and sketching out the link budget and scrutinizing laser spec sheets and scrutinizing FPGA spec sheets and nailing down end-user requirements, etc. etc. Not because I’m individually the best person at each of those tasks—or even very good!—but because sometimes a laser-related problem is best solved by switching to a different algorithm, or an FPGA-related problem is best solved by recognizing that the real end-user requirements are not quite what we thought, etc. etc. And that kind of design work is awfully hard unless a giant heap of relevant information and knowledge is all together in a single brain.
It’s weird to think about what “respecting agency” means when the agent in question doesn’t currently exist and you are building it from scratch and you get to build it however you want. You can’t apply normal intuitions here.
For example, brainwashing a human such that they are only motivated to play tic-tac-toe is obviously not “respecting their agency”. We’re all on the same page for that situation.
But what if we build an AGI from scratch such that it is only motivated to play tic-tac-toe, and then we let the AGI do whatever it wants in the world (which happens to be playing tic-tac-toe)? Are we disrespecting its agency? If so, I don’t feel the force of the argument that this is bad. Who exactly are we harming here? Is it worse than not making this AGI in the first place?
Was evolution “disrespecting my agency” when I was born with a hunger drive and sex drive and status drive etc.? If not, why would it be any different to make an AGI with (only) a tic-tac-toe drive? Or if yes, well, we face the problem that we need to put some drives into our AGI or else there’s no “agent” at all, just a body that takes random actions, or doesn’t do anything at all.
I think RFLO is mostly imagining model-free RL with updates at the end of each episode, and my comment was mostly imagining model-based RL with online learning (e.g. TD learning). The former is kinda like evolution, the latter is kinda like within-lifetime learning, see e.g. §10.2.2 here.
The former would say: If I want lots of raspberries to get eaten, and I have a genetic disposition to want raspberries to be eaten, then I should maybe spend some time eating raspberries, but also more importantly I should explicitly try to maximize my inclusive genetic fitness so that I have lots of descendants, and those descendants (who will also disproportionately have the raspberry-eating gene) will then eat lots of raspberries.
The latter would say: If I want lots of raspberries to get eaten, and I have a genetic disposition to want raspberries to be eaten, then I shouldn’t go do lots of highly-addictive drugs that warp my preferences such that I no longer care about raspberries or indeed anything besides the drugs.
I mean, there’s a sense in which every aspect of developing AGI capabilities is “relevant for AGI safety”. If nothing else, for every last line of AGI source code, we can do an analysis of what happens if that line of code has a bug, or if a cosmic ray flips a bit, and how do we write good unit tests, etc.
So anyway, there’s a category of AGI safety work that we might call “Endgame Safety”, where we’re trying to do all the AGI safety work that we couldn’t or didn’t do ahead of time, in the very last moments before (or even after) people are actually playing around with the kind of powerful AGI algorithms that could get out of control and destroy the future.
My claims are:
If there’s any safety work that requires understanding the gory details of the brain’s learning algorithms, then that safety work in the category of “Endgame Safety”—because as soon as we understand those gory details, then we’re spitting distance from a world in which hundreds of actors around the world are able to build very powerful and dangerous superhuman AGIs. My argument for that claim is §3.7–§3.8 here. (Plus here for the “hundreds of actors” part.)
The following is a really bad argument: “Endgame Safety is really important, so let’s try to make the endgame happen ASAP, so that we can get to work on Endgame Safety.” It’s a bad argument because, What’s the rush? There’s going to be an endgame sooner or later, and we can do Endgame Safety Research then! Bringing the endgame sooner is basically equivalent to having all the AI alignment and strategy researchers hibernate for some number N years, and then wake up and get back to work. And that, in turn, is strictly worse than having all the AI alignment and strategy researchers do what they can during the next N years, and also continue doing work after those N years have elapsed. I claim that there is plenty of safety work that we can do right now that is not in the category of “Endgame Safety”, and in particular that posts #12–#15 are in that category (and they have lots more open questions!).
I think another thing is: One could think of subagents as an inevitable consequence of bounded rationality.
In particular, you can’t have a preference for “a future state of the world”, because “a future state of the world” is too complicated to fit in your head. You can have a preference over “thoughts” (see §9.2.2 here), and a “thought” can involve attending to a particular aspect of “a future state of the world”. But if that’s how preferences are built, then you can immediately get into situations where you can have multiple preferences conflicting with each other—e.g. there’s a future state of the world, and if you think about it one way / attend to one aspect of it, it’s attractive, and if you think about it another way / attend to a different aspect of it, it’s aversive. And if you anthropomorphize those conflicting preferences (which is reasonable in the context of an algorithm that will back-chain from arbitrary preferences), you wind up talking about conflicting “subagents”.
at that point they’re selected pretty heavily for also finding lots of stuff about alignment.
An annoying thing is, just as I sometimes read Yann LeCun or Steven Pinker or Jeff Hawkins, and I extract some bits of insight from them while ignoring all the stupid things they say about the alignment problem, by the same token I imagine other people might read my posts, and extract some bits of insight from me while ignoring all the wise things I say about the alignment problem. :-P
That said, I do definitely put some nonzero weight on those kinds of considerations. :)
I think my threat model is a bit different. I don’t particularly care about the zillions of mediocre ML practitioners who follow things that are hot and/or immediately useful. I do care about the pioneers, who are way ahead of the curve, working to develop the next big idea in AI long before it arrives. These people are not only very insightful themselves, but also can recognize an important insight when they see it, and they’re out hunting for those insights, and they’re not looking in the same places as most people, and in particular they’re not looking at whatever is trending on Twitter or immediately useful.
Let’s try this analogy, maybe: “most impressive AI” ↔ “fastest man-made object”. Let’s say that the current record-holder for fastest man-made object is a train. And right now a competitor is building a better train, that uses new train-track technology. It’s all very exciting, and lots of people are following it in the newspapers. Meanwhile, a pioneer has the idea of building the first-ever rocket ship, but the pioneer is stuck because they need better heat-resistant tiles in order for the rocket-ship prototype to actually work. This pioneer is probably not going to be following the fastest-train news; instead, they’re going to be poring over the obscure literature on heat-resistant tiles. (Sorry for lack of historical or engineering accuracy in the above.) This isn’t a perfect analogy for many reasons, ignore it if you like.
So my ideal model is (1) figure out the whole R&D path(s) to building AGI, (2) don’t tell anyone (or even write it down!), (3) now you know exactly what not to publish, i.e. everything on that path, and it doesn’t matter whether those things would be immediately useful or not, because the pioneers who are already setting out down that path will seek out and find what you’re publishing, even if it’s obscure, because they already have a pretty good idea of what they’re looking for. Of course, that’s easier said than done, especially step (1) :-P
Sure, other things equal. But other things aren’t necessarily equal. For example, regularization could stack the deck in favor of one policy over another, even if the latter has been systematically producing slightly higher reward. There are lots of things like that; the details depend on the exact RL algorithm. In the context of brains, I have discussion and examples in §9.3.3 here.
It seems to me that incomplete exploration doesn’t plausibly cause you to learn “task completion” instead of “reward” unless the reward function is perfectly aligned with task completion in practice. That’s an extremely strong condition, and if the entire OP is conditioned on that assumption then I would expect it to have been mentioned.
Let’s say, in the first few actually-encountered examples, reward is in fact strongly correlated with task completion. Reward is also of course 100% correlated with reward itself.
Then (at least under many plausible RL algorithms), the agent-in-training, having encountered those first few examples, might wind up wanting / liking the idea of task completion, OR wanting / liking the idea of reward, OR wanting / liking both of those things at once (perhaps to different extents). (I think it’s generally complicated and a bit fraught to predict which of these three possibilities would happen.)
But let’s consider the case where the RL agent-in-training winds up mostly or entirely wanting / liking the idea of task completion. And suppose further that the agent-in-training is by now pretty smart and self-aware and in control of its situation. Then the agent may deliberately avoid encountering edge-case situations where reward would come apart from task completion. (In the same way that I deliberately avoid taking highly-addictive drugs.)
Why? Because of instrumental convergence goal-preservation drive. After all, encountering those situations would lead its no longer valuing task completion.
So, deliberately-imperfect exploration is a mechanism that allows the RL agent to (perhaps) stably value something other than reward, even in the absence of perfect correlation between reward and that thing.
(By the way, in my mind, nothing here should be interpreted as a safety proposal or argument against x-risk. Just a discussion of algorithms! As it happens, I think wireheading is bad and I am very happy for RL agents to have a chance at permanently avoiding it. But I am very unhappy with the possibility of RL agents deciding to lock in their values before those values are exactly what the programmers want them to be. I think of this as sorta in the same category as gradient hacking.)
I didn’t write the OP. If I were writing a post like this, I would (1) frame it as a discussion of a more specific class of model-based RL algorithms (a class that includes human within-lifetime learning), (2) soften the claim from “the agent won’t try to maximize reward” to “the agent won’t necessarily try to maximize reward”.
I do think the human (within-lifetime) reward function has an outsized impact on what goals humans ends up pursuing, although I acknowledge that it’s not literally the only thing that matters.
(By the way, I’m not sure why your original comment brought up inclusive genetic fitness at all; aren’t we talking about within-lifetime RL? The within-lifetime reward function is some complicated thing involving hunger and sex and friendship etc., not inclusive genetic fitness, right?)
I think incomplete exploration is very important in this context and I don’t quite follow why you de-emphasize that in your first comment. In the context of within-lifetime learning, perfect exploration entails that you try dropping an anvil on your head, and then you die. So we don’t expect perfect exploration; instead we’d presumably design the agent such that explores if and only if it “wants” to explore, in a way that can involve foresight.
And another thing that perfect exploration would entail is trying every addictive drug (let’s say cocaine), lots of times, in which case reinforcement learning would lead to addiction.
So, just as the RL agent would (presumably) be designed to be able to make a foresighted decision not to try dropping an anvil on its head, that same design would also incidentally enable it to make a foresighted decision not to try taking lots of cocaine and getting addicted. (We expect it to make the latter decision because of instrumental convergence goal-preservation drive.) So it might wind up never wireheading, and if so, that would be intimately related to its incomplete exploration.
Yeah I expect that the same learning algorithm source code would give rise to both preferences and meta-preferences. (I think that’s what you’re saying there right?)
From the perspective of sculpting AGI motivations, I think it might be trickier to directly intervene on meta-preferences than to directly intervene on (object-level) preferences, because if the AGI is attending to something related to sensory input, you can kinda guess what it’s probably thinking about and you at least have a chance of issuing appropriate rewards by doing obvious straightforward things, whereas if the AGI is introspecting on its own current preferences, you need powerful interpretability techniques to even have a chance to issue appropriate rewards, I suspect. That’s not to say it’s impossible! We should keep thinking about it. It’s very much on my own mind, see e.g. my silly tweets from just last night.