I’ve been doing computational cognitive cognitive neuroscience since getting my PhD in 2006, until the end of 2022. I’ve worked on a bunch of brain systems, focusing on the emergent interactions that are needed to explain complex thought. I was increasingly concerned with AGI applications of the research, and reluctant to publish my best ideas. I’m incredibly excited to now be working directly on alignment, currently with generous funding from the Astera Institute. More info and publication list here.
Seth Herd
Me avoiding heroin isn’t “not governed by the critic,” instead what’s going on is that it’s learned behavior based largely on how the critic has acted so far in my life, which happens to generalize in a way that contradicts what the critic would do if I actually tried heroin.
I think we’re largely in agreement on this. The actor system is controlling a lot of our behavior. But it’s doing so as the critic system trained it to do. So the critic is in charge, minus generalization errors.
However, I also want to claim that the critic system is directly in charge when we’re using model-based thinking- when we come up with a predicted outcome before acting, the critic is supplying the estimate of how good that outcome is. But I’m not even sure this is a crux. The critic is still in charge in a pretty important way.
If I go out and become a heroin addict and start to value heroin, that information would also be found in the actor, not in the critic.
I think that information would be found in both the actor and the critic. But not to exactly the same degree. I think the critic probably updates faster. And the end result of the process can be a complex interaction between the actor, a world model (which I didn’t even bring into it in the article) and the critic. For instance, if it doesn’t occur to you to think about the likely consequences of doing heroin, the decision is based on the critic’s prediction that the heroin will be awesome. If the process, governed probably by the actor, does make a prediction of withdrawals and degradation as a result, then the decision is based on a rough sum that includes the critic’s very negative assignment of value to that part of the outcome.
The problem faced by evolution (and also by humans trying to align AI) is that the critic doesn’t start out omniscient, or even particularly clever—it doesn’t actually know what the expectation-discounted reward is.
I totally agree. That’s why the key question here is whether the critic can be reprogrammed after there’s enough knowledge in the actor and the world model.
As for the idea that the critic nudges, I agree. I think the early nudges are provided by a small variety of innate reward signals, and the critic then expands those with theories of the next thing we should explore, as it learns to connect those innate rewards to other sensory representations.
The critic is only representing adult human “values” as the result of tons of iterative learning between the systems. That’s the theory, anyway.
It’s also worth noting that, even if this isn’t how the human system works, it might be a workable scheme to make more alignable AGI systems.
This is a huge practical issue that seems to not get enough thought, and I’m glad you’re thinking about it. I agree with your summary of one way forward. I think there’s another PR front; many educated people outside of the relevant fields are becoming concerned.
It sounds like the ML researchers at that conference are mostly familiar with MIRI style work. And they actually agree with Yudkowsky that it’s a dead end. There’s a newer tradition of safety work focused on deep networks. That’s what you mostly see in the Alignment Forum. And it’s what you see in the safety teams at Deepmind, OpenAI, and Anthropic. And those companies appear to be making more progress than all the academic ML researchers put together.
Agreed on the paragraph size comment. My eyes and brain shy away. Paragraphs I think are supposed to contain roughly one idea, so a one-sentence paragraph is a nice change of pace if it’s an important idea. Your TLDR was great; I think those are better at the top to function as an abstract and tell the reader why they might want to read the whole piece and how to mentally organize it. ADHD is a reason your brain wants to write stream of consciousness, and attention to paragraph structure is a great check on communicating to others in a way that won’t overwhelm their slower brains :)
Other communities should be moving to AF style publication, not the other way around. This is how science should be communicated; it has all the virtues of peer review without the massive downsides.
I just moved from neuroscience to publishing on LessWrong. The publishing structure here is far superior to a journal on the whole. Waiting for peer review instead of getting it in comments is an insane slowdown on the exchange of ideas.
Journal articles are discussed by experts in private. Blog posts are discussed in public in the comments. The difference in amount of analysis shared per amount of time is massive.
Issues like mathematical or other rigor are separate issues. Having tags and other sorting systems to distinguish long and rigorous work from quick writeups of simple ideas, points, and results would allow the best of both worlds.
Furthermore, we have known this for some time. In about 2003 exactly this type of publishing was suggested for neuroscience, for the above reasons—and as a way to give credit for review work. Neuroscience won’t switch to it because of cultural lock-in. Don’t give up your great good fortune in not being stuck in an antique system.
I’m not sure I’m following you. I definitely agree that human behavior is not completely determined by the critic system. And that this complicates the alignment of brain like AGI. For instance, when we act out of habit, the critic is not even invoked until at least until the action is completed, and maybe not at all.
But I think you’re addressing instinctive behavior. If you throw something at my eye, I’ll blink—and this might not take any learning. If an electrical transformer box blows up nearby, I might adopt a stereotyped defensive posture with one arm out and one leg up, even if I’ve never studied martial arts (this is a personal anecdote from a neuroscience instructor on instincts). If you put sugar in my mouth, I’ll probably salivate even as a newborn.
However, those are the best examples I can come up with. I think that evolution has worked by making its preferred outcomes (or rather simple markers of them) be rewarding. The critic system is thought to derive reward from more than the four Fs; curiosity and social approval are often theorized to innately produce reward (although I don’t know of any hard evidence that these are primary rewards rather than learned rewards, after looking a good bit).
Providing an expectation-discounted reward signal is one way to produce progressively-closer-to-desired behaviors. In the mammalian system, I think evolution has good reasons to prefer this route rather than trying to hardwire behaviors in an extremely complex world, and while competing with the whole forebrain system for control of behavior.
But again, I might be misunderstanding you. In any case, thanks for the thoughts!
Edits, to address a couple more points:
I think the critic system, and your conscious predictions and preferences, are very much in charge in your decision not to find some heroin even though it’s reputedly the most rewarding thing you can do with a single chunk of time once you have some. You are factoring in your huge preference to not spend your life like the characters in Trainspotting, stealing and scrounging through filth for another fix. Or at least it seems that’s why I’m not doing it.
The low information budget of evolution is exactly why I think it relies on hardwired reward inputs to the critic for governing behavior in mammals that navigate and learn relatively complex behaviors in relatively complex perceived environments.
It seems you’re saying that a good deal of our behavior isn’t governed by the critic system. My estimate is that even though it’s all ultimately guided by evolution, the vast majority of mammalian behavior is governed by the critic. Which would make it a good target of alignment in a brainlike AGI system.
I’ll look at your posts to see if you discuss this elsewhere. Or pointers would be appreciated.
This type of thinking seems important and somewhat neglected. Holden Karnofsky tossed out a point in success without dignity that the AGI alignment community seems to heavily emphasize theoretical and technical thinking over practical (organizational, policy, and publicity) thinking. This seems right in retrospect, and we should probably be correcting that with more posts like this.
This seems like an important point, but fortunately pretty easy to correct. I’d summarize as: “if you don’t have a well thought out plan for when and how to stop, you’re planning to continue into danger”
I think this is a tricky tradeoff. There’s effectively a race between alignment and capabilities research. Better theories of how AGI is likely to be constructed will help both efforts. Which one it will help more is tough to guess.
The one thought I’d like to add is that the AI safety community may think more broadly and creatively about approaches to building AI. So I wouldn’t assume that all of this thinking has already been done.
I don’t have an answer on this, and I’ve thought about it a lot since I’ve been keeping some potential infohazard ideas under my hat for maybe the last ten years.
I think maybe this was mistitled. It seems to make a solid argument against certainty in AI timelines. It does not argue against the attempt or against taking seriously the distribution across attempts.
It raises the accuracy of some predictions of space flight, then notes that others were never implemented. It could well be that there are multiple viable ways to build a rocket, a steam engine, and an AGI.
Von Braun would weep at our lack of progress on space flight. But we did not progress because there don’t actually seem to be near term economic incentives. There probably are for AGI.
Timelines are highly uncertain, but dismissing the possibility of short timelines makes as little sense as dismissing long timelines.
Human preferences as RL critic values—implications for alignment
Along with finding structured opportunities, you can practice this attitude in most conversations with colleagues, friends, family, and partners. People complain a lot. And when they do, you can practice active listening and empathy. I believe that making this a goal of every conversation has changed my habits over time, so that I do more useful things when it’s important.
This seems like useful analysis and categorization. Thank you.
These are not logically independent probabilities. In some cases, multiple can be combined. Your trial and error, value function, The Plan, etc could mostly all be applied in conjunction and stack success, it seems.
For others, and no coincidentally the more promising ones, like bureaucracy and tool AI, success does not prevent new AGI with different architectures that need new alignment strategies. Unless the first aligned ASI is used to prevent other ASIs from being built.
https://www.lesswrong.com/posts/5hApNw5f7uG8RXxGS/the-open-agency-model
This is highly similar, I think.
In your question, you say that the Siebe post arrived at a 15% chance of long covid. That post in fact stated a 1% to 15% estimate, which is very different. Please edit your quote.
Acknowledging that humans have a variety of values doesn’t invalidate the alignment problem; it makes it harder. Sophisticated discussions of the alignment problem address this additional difficulty.
I think there’s a key error in the logic you present. The idea that a self-improving AGI will very quickly become vastly superior to humanity is based on the original assumption that AGI will consist of a relatively compact algorithm that is mostly software-limited. The newer assumption is vastly slower takeoffs, perhaps years long, but almost certainly much larger than seconds, as hardware-limited neural network AGI finds larger servers or designs and somehow builds more efficient hardware. This scenario puts an AGI in vastly more danger from humanity than your fast takeoff scenario.
Edit: this is not to argue that the correct estimate is as high as 99.999; I’m just making this contribution without doing all the logic and math on my best estimate.
Confirmation bias is an enormous factor in conspiracy theories. If you want accurate beliefs, including accurate uncertainties about your beliefs, you want to do some serious research on confirmation bias and .motivated reasoning. After studying cognitive biases for my job for four years, I believe those two carry the bulk of the practical effect.
Between the lines, I’m guessing your family is accusing you of sounding like a conspiracy theorist because you’re overconfident of your certainty in your new beliefs, based on confirmation bias in your research. I approximately share each of the beliefs you mention, after a modest amount of research on each. But I’m much less certain than you seem to be about exactly where the truth on each subject lies.
My impression after a little research is this: simple carbs that are not encased in fiber tend to cause more hunger sooner than other sources of calories. Therefore, even though all calories are roughly equal in directly causing weight gain, carbs indirectly cause more weight gain. And more discomfort, in fighting hunger.
Few people have done this sort of thinking here, because this community mostly worries about risks from general, agentic AI and not narrow AI. We worry about systems that will decide to eliminate us, and have the superior intelligence to do so. Survivor scenarios are much more likely from risks of narrow AI accidentally causing major disaster. And that’s mostly not what this community worries about.
A couple of thoughts:
It seems to me simpler to think in terms of preference fulfillment than boundaries. I strongly don’t want you in my house or messing with my body without consent, way more than I want you to wear fashions I enjoy seeing. And cells might be said to desire to keep their membranes intact.
Second, I think the idea of acausal trades can be boiled down to causal reasoning. Aren’t we reasoning about the possibility of encountering a civilization with sets of values? This is just like causal reasoning about morality; one reason to be a cooperative person is so that when I encounter other cooperative people, they’ll want to cooperate with me.
Fascinating. I find the core logic totally compelling. LLM must be narratologists, and narratives include villains and false fronts. The logic on RLHF actually making things worse seems incomplete. But I’m not going to discount the possibility. And I am raising my probabilities on the future being interesting, in a terrible way.
Sorry for the obscure reference. Alignment Forum is the professional variant of Less Wrong. It has membership by invitation only, which means you can trust the votes and comments to be better informed, and from real people and not fake accounts.