A consolidated list of bad or incomplete solutions could have considerable didactic value—it could keep people learn more about the various challenges involved.
Ben Smith
It seems like even amongst proponents of a “fast takeoff”, we will probably have a few months of time between when we’ve built a superintelligence that appears to have unaligned values and when it is too late to stop it.
At that point, isn’t stopping it a simple matter of building an equivalently powerful superintelligence given the sole goal of destroying the first one?
That almost implies a simple plan for preparation: for every AGI built, researchers agree together to also build a parallel AGI with the sole goal of defeating the first one. perhaps it would remain dormant until its operators indicate it should act. It would have an instrumental goal of protecting users’ ability to come to it and request the first one be shut down..
You haven’t factored in the possibility Putin gets deposed by forces inside Russia who might be worried about a nuclear war and conditional on use of tactical nukes, intuitively that seems likely enough to materially lower p(kaboom).
I have a few technical quibbles here.
-
It’s not quite accurate to imply Philosophical Transactions of the Royal Society B itself has made a claim about the robustness of MCB. Generally only an editorial endorsed by the editorial board should be taken as a statement about the journal’s position in particular. Generally academic journals provide a forum for academic debate. Only the authors of an article are really standing behind the position. The journal only publishes work of a certain standard, and publishing a paper is somewhat an endorsement of the quality of work in the paper, but not of the finding itself.
-
Generally I understand it is not a given that MCB will work. Only the sulphur dioxide solution is really proven, and “sulphur” makes me nervous (I don’t know if there are good grounds for that). More research is needed on MCB, which a point in favour that the research should be funded and carried out as soon as possible.
-
It may be that researchers could do research into MCB without it being seen as an endorsement by the entire field that we don’t need other solutions. MCB can and should be presented as important experimental work that needs to be done, as a last resort. When it is actually proven, I think that’s when we have the dilemma about what to tell the public. But at that point, looking at the status quo, it may be our only option left.
-
Hey Steve, I am reading through this series now and am really enjoying it! Your work is incredibly original and wide-ranging as far as I can see—it’s impressive how many different topics you have synthesized.
I have one question on this post—maybe doesn’t rise above the level of ‘nitpick’, I’m not sure. You mention a “curiosity drive” and other Category A things that the “Steering Subsystem needs to do in order to get general intelligence”. You’ve also identified the human Steering Subsystem as the hypothalamus and brain stem.
Is it possible things like a “curiosity drive” arises from, say, the way the telenchephalon is organized, rather than from the Steering Subsystem itself? To put it another way, if the curiosity drive is mainly implemented as motivation to reduce prediction error, or fill the the neocortex, how confident are you in identifying this process with the hypothalamus+brain stem?
I think I imagine the way in which I buy the argument is something like “steering system ultimately provides all rewards and that would include reward from prediction error”. But then I wonder if you’re implying some greater role for the hypothalamus+brain stem or not.
I know it’s a touchy topic. In my defense, the research is solid, published in social psychology’s top journal. I suppose the study deals with rhetoric in a political context. This community has a long history of drawing on social and cognitive psychological research to understand fallacies of thought and rhetoric (HPMOR), and I posted in that tradition. Apologies if I have strayed a little too far into a politicized area.
One needn’t see this study as a shot at any particular political side—I can imagine people engaging ‘virtuous-victimhood-signalling’ within a wide range of different politicized narratives, as well as in completely apolitical contexts.
It also shouldn’t be read to delegitimize victims from speaking out about their perspective! But perhaps it does provide evidence that sympathy can be weaponized in rhetorical conflict. We can all recognize this in political opponents and be blind to it amongst political allies.
I found the Clark et al. (2019) “Bayesing Qualia” article very useful, and that did give me an intuition of the account that perhaps sentience arises out of self-awareness. But they themselves acknowledged in their conclusion that the paper didn’t quite demonstrate that principle, and I didn’t find myself convinced of it.
Perhaps what I’d like readers to take away is that sentience and self-awareness can be at the very least conceptually distinguished. Even if it isn’t clear empirically whether or not they are intrinsically linked, we ought to maintain a conceptual distinction in order to form testable hypotheses about whether they are in fact linked, and in order to reason about the nature of any link. Perhaps I should call that “Theoretical orthogonality”. This is important to be able to reason whether, for instance, giving our AIs a self-awareness or situational awareness will cause them to be sentient. I do not think that will be the case, although I do think that, if you gave them the sort of detailed self-monitoring feelings that humans have, that may yield sentience itself. But it’s not clear!
I listened to the whole episode with Bach as a result of your recommendation! Bach hardly even got a chance to express his ideas, and I’m not much closer to understanding his account of
meta-awareness (i.e., awareness of awareness) within the model of oneself which acts as a ‘first-person character’ in the movie/dream/”controlled hallucination” that the human brain constantly generates for oneself is the key thing that also compels the brain to attach qualia (experiences) to the model. In other words, the “character within the movie” thinks that it feels something because it has meta-awareness (i.e., the character is aware that it is aware (which reflects the actual meta-cognition in the brain, rather than in the brain, insofar the character is a faithful model of reality).
which seems like a crux here.
He sort of briefly described “consciousness as a dream state” at the very end, but although I did get the sense that maybe he thinks meta-awareness and sentience are connected, I didn’t really hear a great argument for that point of view.
He spent several minutes arguing that agency, or seeking a utility function, is something humans have, but that these things aren’t sufficient for consciousness (I don’t remember whether he said whether they were necessary, so I suppose we don’t know if he thinks they’re orthogonal).
That was an inspiring and enjoyable read!
Can you say why you think AUP is “pointless” for Alignment? It seems to me attaining cautious behavior out of a reward learner might turn out to be helpful. Overall my intuition is it could turn out to be an essential piece of the puzzle.
I can think of one or two reasons myself, but I barely grasp the finer points of AUP as it is, so speculation on my part here might be counterproductive.
I think your point is interesting and I agree with it, but I don’t think Nature are only addressing the general public. To me, it seems like they’re addressing researchers and policymakers and telling them what they ought to focus on as well.
Owning a house doesn’t give you fewer ongoing costs. It tends to give you less costs overall, but that’s heavily contingent on rental and mortgage rates. And it’s actually more administrative hassle, because you have to spend money on rates (local property taxes), repairs, and so on. The main thing owning a house gives you is it gives you is stability in terms of predicting future price changes.
For me the issue is that
-
it isn’t clear how you could enforce attendance or
-
what value individual attendees could have to make it worth their while to attend regularly.
(2) is sort of a collective action/game theoretic/coordination problem.
(1) reflects the rationalist nature of the organization.
Traditional religions back up attendance by divine command. They teach absolutist, divine command theoretic accounts of morality, backed up by accounts of commands from God to attend regularly. At the most severe mode these are backed by threat of eternal hellfire for disobedience. But it doesn’t usually come to that. The moralization of the attendance norm is strong enough to justify moderate amounts of social pressure to conform to it. Often that’s enough.
In a rationalist congregation, if you want a regular attendance norm, you have to ground it in a rational understanding that adhering to the norm makes the organization work. I think that might work, but it’s probably a lot harder because it requires a lot more cognitive steps to get to and it only works so long as attendees buy into the goal of contributing to the project for its own sake.
-
At this moment in time I have two theories about how shards seem to be able to form consistent and competitive values that don’t always optimize for some ultimate goal:
Overall, Shard theory is developed to describe behavior of human agents whose inputs and outputs are multi-faceted. I think something about this structure might facilitate the development of shards in many different directions. This seems different to modern deep RL agent; although they also potentially can have lots of input and output nodes, these are pretty finely honed to achieve a fairly narrow goal, and so in a sense, it is not too much of a surprise they seem to Goodhart on the goals they are given at times. In contrast, there’s no single terminal value or single primary reinforcer in the human RL system: sugary foods score reward points, but so do salty foods when the brain’s subfornical region indicates there’s not enough sodium in the bloodstream (Oka, Ye, Zuker, 2015); water consumption also gets reward points when there’s not enough water. So you have parallel sets of reinforcement developing from a wide set of primary reinforcers all at the same time.
As far as I know, a typical deep RL agent is structured hierarchically, with feedforward connections from inputs at one end to outputs at the other, and connections throughout the system reinforced with backpropagation. The brain doesn’t use backpropagation (though maybe it has similar or analogous processes); it seems to “reward” successful (in terms of prediction error reduction, or temporal/spatial association, or simply firing at the same time...?) connections throughout the neocortex, without those connections necessarily having to propagate backwards from some primary reinforcer.
The point about being better at credit assignment as you get older is probably not too much of a concern. It’s very high level, and to the extent it is true, mostly attributable to a more sophisticated world model. If you put a 40 year old and an 18 year old into a credit assignment game in a novel computer game environment, I doubt the 40 year old will do better. they might beat a 10 year old, but only to the extent the 40 year old has learned very abstract facts about associations between objects which they can apply to the game. speed it up so that they can’t use system 2 processing, and the 10 year old will probably beat them.
great post, two points of disagreement that are worth mentioning
Exploring the full ability of dogs and cats to communicate isn’t so much impractical to do in academia; it just isn’t very theoretically interesting. We know animals can do operant conditioning (we’ve known for over 100 years probably), but we also know they struggle with complex syntax. I guess there’s a lot of uncertainty in the middle, so I’m low confidence about this. But generally to publish a high impact paper about dog or cat communication you’d have to show they can do more than “conditioning”, that they understand syntax in some way. That’s probably pretty hard; maybe you can do it, but do you want to stake your career on it?
That brings me to my second point...is it more than operant conditioning? Some of the videos show the animals pressing multiple buttons. But Billy the Cat’s videos show his trainer teaching his button sequences. I’m not a language expert, but to demonstrate syntax understanding, you have to do more than show he can learn sequences of button presses he was taught verbatim. At a minimum there’d need to be evidence he can form novel sentences by combining buttons in apparently-intentional ways that could only be put together by generalizing from some syntax rules. Maaaybe @Adele Lopez ’s observation that Bunny seems to reverse her owner’s word order might be appropriate evidence. But if she’s been reinforced for her own arbitrarily chosen word order in the past, she might develop it without really appreciating rules of syntax per se. In fact, a hallmark of learning language is that you can learn syntax correctly.
interpretability on pretrained model representations suggest they’re already internally “ensembling” many different abstractions of varying sophistication, with the abstractions used for a particular task being determined by an interaction between the task data available and the accessibility of the different pretrained abstraction
That seems encouraging to me. There’s a model of AGI value alignment where the system has a particular goal it wants to achieve and brings all it’s capabilities to bear on achieving that goal. It does this by having a “world model” that is coherent and perhaps a set of consistent bayesian priors about how the world works. I can understand why such a system would tend to behave in a hyperfocused way to go out to achieve its goals.
In contrast, a systems with an ensemble of abstractions about the world, many of which may even be inconsistent, seems much more human like. It seems more human like specifically in that the system won’t be focused on a particular goal, or even a particular perspective about how to achieve it, but could arrive at a particular solution ~~randomly, based on quirks of training data.
I wonder if there’s something analogous to human personality, where being open to experience or even open to some degree of contradiction (in a context where humans are generally motivated to minimize cognitive dissonance) is useful for seeing the world in different ways and trying out strategies and changing tack, until success can be found. If this process applies to selecting goals, or at least sub-goals, which it certainly does in humans, you get a system which is maybe capable of reflecting on a wide set of consequences and choosing a course of action that is more balanced, and hopefully balanced amongst the goals we give a system.
thank you for writing this. I really personally appreciate it!
Then the next thing I want to suggest is that the system uses human resolution of conflicting outcomes to train itself to predict how a human would resolve a conflict, and if it is higher than a suitable level of confidence, it will go ahead and act without human intervention. But any prediction of what a human would predict could be second-guessed by a human pointing out where the prediction is wrong.
Agreed that whether a human understanding the plan (and all the relevant outcomes. which outcomes are relevant?) is important and harder than I first imagined.
American Academy of Pediatrics lies to us once again....
“If caregivers are wearing masks, does that harm kids’ language development? No. There is no evidence of this. And we know even visually impaired children develop speech and language at the same rate as their peers.”
This is a textbook case of the Law of No Evidence. Or it would be, if there wasn’t any Proper Scientific Evidence.Is it, though? I’m no expert, but I tried to find Relevant Literature. Sometimes, counterintuitive things are true.
https://www.researchgate.net/publication/220009177_Language_Development_in_Blind_Children:
Blindness affects congenitally blind children’s development in different ways, language development being one of the areas less affected by the lack of vision.
Most researchers have agreed upon the fact that blind children’s morphological development, with the exception of personal and possessive pronouns, is not delayed nor impaired in comparison to that of sighted children, although it is different.
As for syntactic development, comparisons of MLU scores throughout development indicate that blind children are not delayed when compared to sighted children
Blind children use language with similar functions, and learn to perform these functions at the same age as sighted children. Nevertheless, some differences exist up until 4;6 years; these are connected to the adaptive strategies that blind children put into practice, and/or to their limited access to information about external reality. However these differences disappear with time (Pérez-Pereira & Castro, 1997). The main early difference is that blind children tend to use self-oriented language instead of externally oriented language.I don’t know exactly where that leaves us evidentially. Perhaps the AAP is lying by omission by not telling us about things other than language that are affected by children’s sight.
That’s a bit different to the dishonesty alleged, though.
Not sure what I was thinking about, but probably just that my understanding is that “safe AGI via AUP” would have to penalize the agent for learning to achieve anything not directly related to the end goal, and that might make it too difficult to actually achieve the end goal when e.g. it turns out to need tangentially related behavior.
Your “social dynamics” section encouraged me to be bolder sharing my own ideas on this forum, and I wrote up some stuff today that I’ll post soon, so thank you for that!
That’s right. What I mainly have in mind is a vector of Q-learned values V and a scalarization function that combines them in some (probably non-linear) way. Note that in our technical work, the combination occurs during action selection, not during reward assignment and learning.
I guess whether one calls this “multi-objective RL” is semantic. Because objectives are combined during action selection, not during learning itself, I would not call it “single objective RL with a complicated objective”. If you combined objectives during reward, then I could call it that.
re: your example of real-time control during hunger, I think yours is a pretty reasonable model. I haven’t thought about homeostatic processes in this project (my upcoming paper is all about them!). Definitely am not suggesting that our particular implementation of “MORL” (if we can call it that) is the only or even the best sort of MORL. I’m just trying to get started on understanding it! I really like the way you put it. It makes me think that perhaps the brain is a sort of multi-objective decision-making system with no single combinatory mechanism at all except for the emergent winner of whatever kind of output happens in a particular context—that could plausibly be different depending on whether an action is moving limbs, talking, or mentally setting an intention for a long term plan.
There’s not just acceptance at stake here. Medical insurance companies are not typically going to buy into a responsibility to support clients’ morphological freedom, as if medically transitioning is in the same class of thing as a cis person getting a facelift
woman getting a boob job, because it is near-universally understood this is an “elective” medical procedure. But if their clients have a “condition” that requires “treatment”, well, now insurers are on the hook to pay.A lot of mental health treatment works the same way imho—people have various psychological states, many of which get inappropriately shoehorned into a pathology or illness narrative in order to get the insurance companies to pay.
All this adds a political dimension to the not inconsiderable politics of social acceptance.