I’ve been doing computational cognitive neuroscience research since getting my PhD in 2006, until the end of 2022. I’ve worked on computatonal theories of vision, executive function, episodic memory, and decision-making. I’ve focused on the emergent interactions that are needed to explain complex thought. I was increasingly concerned with AGI applications of the research, and reluctant to publish my best ideas. I’m incredibly excited to now be working directly on alignment, currently with generous funding from the Astera Institute. More info and publication list here.
Seth Herd
Seriously, what? I’m missing something critical. Under the stated rules as I understand them, I don’t see why anyone would punish another player for reducing their dial.
You state that 99 is a nash equilibrium, but this just makes no sense to me. Is the key that you’re stipulating that everyone must play as though everyone else is out to make it as bad as possible for them? That sounds like an incredibly irrational strategy.
Other communities should be moving to AF style publication, not the other way around. This is how science should be communicated; it has all the virtues of peer review without the massive downsides.
I just moved from neuroscience to publishing on LessWrong. The publishing structure here is far superior to a journal on the whole. Waiting for peer review instead of getting it in comments is an insane slowdown on the exchange of ideas.
Journal articles are discussed by experts in private. Blog posts are discussed in public in the comments. The difference in amount of analysis shared per amount of time is massive.
Issues like mathematical or other rigor are separate issues. Having tags and other sorting systems to distinguish long and rigorous work from quick writeups of simple ideas, points, and results would allow the best of both worlds.
Furthermore, we have known this for some time. In about 2003 exactly this type of publishing was suggested for neuroscience, for the above reasons—and as a way to give credit for review work. Neuroscience won’t switch to it because of cultural lock-in. Don’t give up your great good fortune in not being stuck in an antique system.
I thought he just meant “criticism is good, actually; I like having it done to me so I’m going to do it to you”, and was saying that rationalists tend to feel this way.
Welp, that looks like one central crux right there:
No need to worry about creating “zombie” forms of higher intelligence, as these will be at a thermodynamic/evolutionary disadvantage compared to conscious/higher-level forms of intelligence
I think the most important thing to note is that this hasn’t been part of enough discussions to make it into Zvi’s summary. What is happening sounds like the worst sort of polarization. Ad hominem attacks create so much mutual irritation that the discussion bogs down entirely. This effect is surprisingly powerful. I think polarization is the mind-killer.
On the object level, I think that question is interesting, and not clear-cut.
You’re assuming the two alternatives are that everything she’s said is true and accurate, or else nothing is. It does not require psychosis to make wrong interpretations or to have mild paranoia. It merely requires not being a dedicated rationalist, and/or having a hard life. I’m pretty sure that being abused would help cause paranoia, helping her to get some stuff wrong.
Unfortunately, it’s going to be impossible to disentangle this without more specific evidence. Psychology is complicated. Both real recovered memories and fabricated memories seem to be common.
You didn’t bother estimating the base rate of sexual abuse by siblings. While that’s very hard to figure out, it’s very likely in the same neighborhood as your 1-3% psychosis. And it’s even harder to study or estimate. So this isn’t going to help much in resolving the issue.
Upvoted for making well-argued and clear points.
I think what you’ve accomplished here is eating away at the edges of the AGI x-risk argument. I think you argue successfully for longer timelines and a lower P(doom). Those timelines and estimates are shared by many of us who are still very worried about AGI x-risk.
Your arguments don’t seem to address the core of the AGI x-risk argument.
You’ve argued against many particular doom scenarios, but you have not presented a scenario that includes our long term survival. Sure, if alignment turns out to be easy we’ll survive; but I only see strong arguments that it’s not impossible. I agree, and I think we have a chance; but it’s just a chance, not success by default.
I like this statement of the AGI x-risk arguments. It’s my attempt to put the standard arguments of instrumental convergence and capabilities in common language:
Something smarter than you will wind up doing whatever it wants. If it wants something even a little different than you want, you’re not going to get your way. If it doesn’t care about you even a little, and it continues to become more capable faster than you do, you’ll cease being useful and will ultimately wind up dead. Whether you were eliminated because you were deemed dangerous, or simply outcompeted doesn’t matter. It could take a long time, but if you miss the window of having control over the situation, you’ll still wind up dead.
This could of course be expanded on ad infinitum, but that’s the core argument, and nothing you’ve said (on my quick read, sorry if I’ve missed it) addresses any of those points.
There were (I’ve been told) nine other humanoid species. They are all dead. The baseline outcome of creating something smarter than you is that you are outcompeted and ultimately die out. The baseline of assuming survival seems based on optimism, not reason.
So I agree that P(doom) is less than 99%, but I think the risk is still very much high enough to devote way more resources and caution than we are now.
Some more specific points:
Fanatical maximization isn’t necessary for doom. An agent with any goal still invokes instrumental convergence. It can be as slow, lazy, and incompentent as you like. The only question is if it can outcompete you in the long run.
Humans are somewhat safe (but think about the nuclear standoff; I don’t think we’re even self-aligned in the medium term). But there are two reasons: humans can’t self-improve very well; AGI has many more routes to recursive self-improvement. On the roughly level human playing field, cooperation is the rational policy. In a scenario where you can focus on self-improvement, cooperation doesn’t make sense long-term. Second, humans have a great deal of evolution to make our instincts guide us toward cooperation. AGI will not have that unless we build it in, and we have only very vague ideas of how to do that.
Loose initial alignment is way easier than a long-term stable alignment. Existing alignment work barely addresses long-term stability.
A balance of power in favor of aligned AGI is tricky. Defending against misaligned AGI is really difficult.
Thanks so much for engaging seriously with the ideas, and putting time and care into communicating clearly!
- 30 May 2023 22:25 UTC; 5 points) 's comment on The bullseye framework: My case against AI doom by (
- 31 May 2023 1:27 UTC; 1 point) 's comment on The bullseye framework: My case against AI doom by (
I think the structure of Alignment Forum vs. academic journals solves a surprising number of the problems you mention. It creates a different structure for both publication and prestige. More on this at the end.
It was kind of cathartic to read this. I’ve spent some time thinking about the inefficiencies of academia, but hadn’t put together a theory this crisp. My 23 years in academic cognitive psychology and cognitive neuroscience would have been insanely frustrating if I hadn’t been working on lab funding. I resolved going in that I wasn’t going to play the publish-or-perish game and jump through a bunch of strange hoops to do what would be publicly regarded as “good work”.
I think this is a good high-level theory of what’s wrong with academia. I think one problem is that academic fields don’t have a mandate to produce useful progress, just progress. It’s a matter of inmates running the asylum. This all makes some sense, since the routes to making useful progress aren’t obvious, and non-experts shouldn’t be directly in charge of the directions of scientific progress; but there’s clearly something missing when no one along the line has more than a passing motivation to select problems for impact.
Around 2006 I heard Tal Yarkoni, a brilliant young scientist, give a talk on the structural problems of science and its publication model. (He’s now ex-scientist as many brilliant young scientists become these days). The changes he advocated were almost precisely the publication and prestige model of the Alignment Forum. It allows publications of any length and format, and provides a public time stamp for when ideas were contributed and developed. It also provides a public record, in the form of karma scores, for how valuable the scientific community found that publication. This only works in a closed community of experts, which is why I’m mentioning AF and not LW. One’s karma score is publicly visible as a sum-total-of-community-appreciation of that person’s work.
This public record of appreciation breaks an important deadlocking incentive structure in the traditional scientific publication model: If you’re going to find fault with a prominent theory, your publication of it had better be damned good (or rather “good” by the vague aesthetic judgments you discuss). Otherewise you’ve just earned a negative valence from everyone who likes that theory and/or the people that have advocated it, with little to show for it. I think that’s why there’s little market for the type of analysis you mention, in which someone goes through the literature in painstaking detail to resolve a controversy in the litterature, and then finds no publication outlet for their hard work.
This is all downstream of the current scientific model that’s roughly an advocacy model. As in law, it’s considered good and proper to vigorously advocate for a theory even if you don’t personally think it’s likely to be true. This might make sense in law, but in academia it’s the reason we sometimes say that science advances one funeral at a time. The effect of motivated reasoning combined with the advocacy norm cause scientists to advocate their favorite wrong theory unto their deathbed, and be lauded by most of their peers for doing so.
The rationalist stance of asking that people demonstrate their worth by changing their mind in the face of new evidence is present in science, but it seemed to me much less common than the advocacy norm. This rationalist norm provides partial resistance to the effects of motivated reasoning. That is worth it’s own post, but I’m not sure I’ll get around to writing it before the singularity.
These are all reasons that the best science is often done outside of academia.
Anyway, nice thought-provoking article.
This is much better than any of his other speaking appearances. The short format, and TED’s excellent talk editing/coaching, have really helped.
This is still terrible.
I thought it was a TEDx talk, and I thought it was perhaps the worst TEDx talk I’ve seen. (I agree that it’s rare to see a TEDx talk with good content, but the deliveries are usually vastly better than this).
I love Eliezer Yudkowsky. He is the reason I’m in this field, and I think he’s one of the smartest human beings alive. He is also one of the best-intentioned people I know. This is not a critique of Yudkowsky as an individual.
He is not a good public speaker.
I’m afraid having him as the public face of the movement is going to be devastating. The reactions I see to his public statements indicate that he is creating polarization. His approach makes people want to find reasons to disagree with him. And individuals motivated to do that will follow their confirmation bias to focus on counterarguments.
I realize that he had only a few days to prepare this. That is not the problem. The problem is a lack of public communication skills. Those are very different than communicating with your in-group.
Yudkowsky should either level up his skills, rapidly, or step aside.
There are many others with more talent and skills for this type of communication.
Eliezer is rapidly creating polarization around this issue, and that is very difficult to undo. We don’t have time to do that.
Could we bull through with this approach, and rely on the strength of the arguments to win over public opinion? That might work. But doing that instead of actually thinking about strategy and developing skills would hurt our odds of survival, perhaps rather badly.
I’ve been afraid to say this in this community. I think it needs to be said.
Regulation and complexity of effects seem like another two big blockers.
Effects of genes are complex. Knowing a gene is involved in intelligence doesn’t tell us what it does and what other effects it has.
I wouldn’t accept any edits to my genome without the consequences being very well understood (or in a last-ditch effort to save my life). I’d predict severe mental illness would happen alongside substantial intelligence gains.
Source: research career as a computational cognitive neuroscientist.
I put this as a post- ASI technology, but that’s also a product of my relatively short timelines.
Downvote for being absurdly overconfident, and thereby harming the whole direction of more optimism on alignment. I’d downvote Eliezer for the same reason on his 99.99% doom arguments in public; they are visibly silly, making the whole direction seem silly by association.
In both cased, there are too many unknown unknowns to have confidences remotely that high. And you’ve added way more silly zeros than EY, despite having looser arguments.
This is a really important topic; we need serious discussion of how to really think about alignment difficulty. This is a serious attempt, but it’s just not realistically humble. It also seems to be ignoring the cultural norm and explicit stated goal of writing to inform, not to persuade, on LW.
So, I look forward to your next iteration, improved by the feedback on this post!
I’m not sure what the takeaway is here, but these calculations are highly suspect. What a memory athlete can memorize (in their domain of expertise) in 5 minutes is an intricate mix of working memory and long-term semantic memory, and episodic (hippocampal) memory.
This is a very deep topic. Reading comprehension researchers have estimated the size of working memory as “unlimited”, but that’s obviously specific to their methods of measurement.
Modern debates on working memory capacity are 1-4 items. 7 was specific to what is now known as the phonological loop, which is subvocally reciting numbers. The strong learned connections between auditory cortex and verbal motor areas gives this a slight advantage over working memory for material that hasn’t been specifically practiced a lot.
See the concept of exformation, incidentally from one of the best books I’ve found on consciousness. Bits of information encoded by a signal to a sophisticated system is intricately intermixed with that system’s prior learning. It’s a type of compression. Not making a call at a specific time can encode a specific signal of unlimited length, if sender and receiver agree to that meaning.
Sorry for the lack of citations. I’ve had my head pretty deeply into this stuff in the past, but I never saw the importance of getting a precise working memory capacity estimate. The brain mechanisms are somewhat more interesting to me, but for different reasons than estimating capacity (they’re linked to goals and reward system operation, since working memory for goals and strategy is probably how we direct behavior in the short term).
His statements seem better read as evasions than arguments It seems pretty clear that Lecun is functioning as a troll in this exchange and elsewhere. He does not have a thought out position on AI risk. I don’t find it contradictory that someone could be really good at thinking about some stuff, and simply prefer not to think about some other stuff.
Great analysis. I’m impressed by how thoroughly you’ve thought this through in the last week or so. I hadn’t gotten as far. I concur with your projected timeline, including the difficulty of putting time units onto it. Of course, we’ll probably both be wrong in important ways, but I think it’s important to at least try to do semi-accurate prediction if we want to be useful.
I have only one substantive addition to your projected timeline, but I think it’s important for the alignment implications.
LLM-bots are inherently easy to align. At least for surface-level alignment. You can tell them “make me a lot of money selling shoes, but also make the world a better place” and they will try to do both. Yes, there are still tons of ways this can go off the rails. It doesn’t solve outer alignment or alignment stability, for a start. But GPT4′s ability to balance several goals, including ethical ones, and to reason about ethics, is impressive.[1] You can easily make agents that both try to make money, and thinks about not harming people.
In short, the fact that you can do this is going to seep into the public consciousness, and we may see regulations and will definitely see social pressure to do this.
I think the agent disasters you describe will occur, but they will happen to people that don’t put safeguards into their bots, like “track how much of my money you’re spending and stop if it hits $X and check with me”. When agent disasters affect other people, the media will blow it sky high, and everyone will say “why the hell didn’t you have your bot worry about wrecking things for others?”. Those who do put additional ethical goals into their agents will crow about it. There will be pressure to conform and run safe bots. As bot disasters get more clever, people will take more seriously the big bot disaster.
Will all of that matter? I don’t know. But predicting the social and economic backdrop for alignment work is worth trying.
Edit: I finished my own followup post on the topic, Capabilities and alignment of LLM cognitive architectures. It’s a cognitive psychology/neuroscience perspective on why these things might work better, faster than you’d intuitively think. Improvements to the executive function (outer script code) and episodic memory (pinecone or other vector search over saved text files) will interact so that improvements in each make the rest of system work better and easier to improve.
- ^
I did a little informal testing of asking for responses in hypothetical situations where ethical and financial goals collide, and it did a remarkably good job, including coming up with win/win solutions that would’ve taken me a while to come up with. It looked like the ethical/capitalist reasoning of a pretty intelligent person; but also a fairly ethical one.
- ^
Excellent story.
The timeline for advances in a year is pretty fast, but on the other hand, it’s not clear that we actually need all of the advances you describe.
It continually baffles me that people can look at LLMs which have 140 equivalent IQ relative to most questions and say “but surely there’s no way to use that intelligence to make it agentic or to make it teach itself...”
He’s saying all the right things. Call me a hopeless optimist, but I tend to believe he’s sincere in his concern for the existential risks of misalignment.
I’m not sure I agree with him on the short timelines to prevent overhang logic, and he’s clearly biased there, but I’m also not sure he’s wrong. It depends on how much we could govern progress, and that is a very complex issue.
Since you didn’t summarize the argument in that essay, I went and skimmed it. I’d love to not believe the orthogonality thesis.
I found no argument. The content was “the orthogonality thesis isn’t necessarily true”. But he did accept a “wide angle”, which seems like it would be plenty for standard doom stories. “Human goals aren’t orthogonal” was the closest to evidence. That’s true, but evolution carefully gave us our goals/values to align us with each other.
The bulk was an explicit explanation of the emotional pulls that made him want to not believe in the orthogonality thesis. Then he visibly doesn’t grapple with the actual argument.
At the core, this is a reminder to not publish things that will help more with capabilities than alignment. That’s perfectly reasonable.
The tone of the post suggests erring on the side of “safety” by not publishing things that have an uncertain safety/capabilities balance. I hope that wasn’t the intent.
Because that does not make sense. Anything that advances alignment more than safety in expectation should be published.
You have to make a difficult judgment call for each publication. Be mindful of your bias in wanting to publish to show off your work and ideas. Get others’ insights if you can do so reasonably quickly.
But at the end of the day, you have to make that judgment call. There’s no consolation prize for saying “at least I didn’t make the world end faster”. If you’re a utilitarian, winning the future is the only goal.
(If you’re not a utilitarian, you might actually want a resolution faster so you and your loved ones have higher odds of surviving into the far future.)
Okay, I’ll try to steelman the argument. Some of this comes from OpenAI and Altman’s posts; some of it is my addition.
Allowing additional compute overhang increases the likely speed of takeoff. If AGI through LLMs is possible, and that isn’t discovered for another 5 years, it might be achieved in the first go, with no public discussion and little alignment effort.
LLMs might be the most-alignable form of AGI. They are inherently oracles, and cognitive architectures made from them have the huge advantage of natural language alignment and vastly better interpretability than other deep network approaches. I’ve written about this in Capabilities and alignment of LLM cognitive architectures. I’m eager to learn I’m wrong, but in the meantime I actually think (for reasons spelled out there and to be elaborated in future posts) that pushing capabilities of LLMs and cognitive architectures is our best hope for achieving alignment, even if that speeds up timelines. Under this logic, slowing down LLM progress would be dangerous, as other approaches like RL agents would pass them by before appearing dangerous.
Edit: so in sum, I think their logic is obviously self-serving, but actually pretty solid when it’s steelmanned. I intend to keep pushing this discussion in future posts.
The ITT does something different and worthwhile: it establishes goodwill.
If you care not just about the truth but about me, I will engage with you differently and very likely in a better way to cooperatively work toward truth.
I keep meaning to write “an overlooked goddam basic of rationalist discourse: be fucking nice”.
I think rationalists overlook massive improvements you get in discussion if the people are actively cooperating rather than feeling antipathy.
This is silly and beautiful and profound. I have never heard rationalist music before, and I find it quite moving to hear it for the first time. Several songs brought tears to my eyes (although I’ve practiced opening those emotional channels a bit, so this is a less uncommon experience for me than for most)
I think this says something about the potential of AI to democratize art and allow high quality art aimed at small minority subgroups.
I want more. Thank you to all of those who made this happen.