Seth Herd

Karma: 2,641

I’ve been doing computational cognitive neuroscience research since getting my PhD in 2006, until the end of 2022. I’ve worked on computatonal theories of vision, executive function, episodic memory, and decision-making. I’ve focused on the emergent interactions that are needed to explain complex thought. I was increasingly concerned with AGI applications of the research, and reluctant to publish my best ideas. I’m incredibly excited to now be working directly on alignment, currently with generous funding from the Astera Institute. More info and publication list here.

Agentized LLMs will change the alignment landscape

Seth Herd9 Apr 2023 2:29 UTC

153 points

95 comments3 min readLW link

Capabilities and alignment of LLM cognitive architectures

Seth Herd18 Apr 2023 16:29 UTC

80 points

18 comments20 min readLW link

Seth Herd 1 Apr 2024 19:00 UTC
73 points
17
on: The Story of “I Have Been A Good Bing”
This is silly and beautiful and profound. I have never heard rationalist music before, and I find it quite moving to hear it for the first time. Several songs brought tears to my eyes (although I’ve practiced opening those emotional channels a bit, so this is a less uncommon experience for me than for most)

I think this says something about the potential of AI to democratize art and allow high quality art aimed at small minority subgroups.

I want more. Thank you to all of those who made this happen.

Shane Legg interview on alignment

Seth Herd28 Oct 2023 19:28 UTC

65 points

20 comments2 min readLW link

(www.youtube.com)

Internal independent review for language model agent alignment

Seth Herd7 Jul 2023 6:54 UTC

53 points

26 comments11 min readLW link

OpenAI Staff (including Sutskever) Threaten to Quit Unless Board Resigns

Seth Herd20 Nov 2023 14:20 UTC

52 points

29 comments1 min readLW link

(www.wired.com)

AI scares and changing public beliefs

Seth Herd6 Apr 2023 18:51 UTC

45 points

21 comments6 min readLW link

Seth Herd 1 May 2023 13:05 UTC
44 points
27
on: Hell is Game Theory Folk Theorems
Seriously, what? I’m missing something critical. Under the stated rules as I understand them, I don’t see why anyone would punish another player for reducing their dial.

You state that 99 is a nash equilibrium, but this just makes no sense to me. Is the key that you’re stipulating that everyone must play as though everyone else is out to make it as bad as possible for them? That sounds like an incredibly irrational strategy.

Seth Herd 19 Mar 2023 17:58 UTC
42 points
25
on: “Publish or Perish” (a quick note on why you should try to make your work legible to existing academic communities)
Other communities should be moving to AF style publication, not the other way around. This is how science should be communicated; it has all the virtues of peer review without the massive downsides.

I just moved from neuroscience to publishing on LessWrong. The publishing structure here is far superior to a journal on the whole. Waiting for peer review instead of getting it in comments is an insane slowdown on the exchange of ideas.

Journal articles are discussed by experts in private. Blog posts are discussed in public in the comments. The difference in amount of analysis shared per amount of time is massive.

Issues like mathematical or other rigor are separate issues. Having tags and other sorting systems to distinguish long and rigorous work from quick writeups of simple ideas, points, and results would allow the best of both worlds.

Furthermore, we have known this for some time. In about 2003 exactly this type of publishing was suggested for neuroscience, for the above reasons—and as a way to give credit for review work. Neuroscience won’t switch to it because of cultural lock-in. Don’t give up your great good fortune in not being stuck in an antique system.

Goals selected from learned knowledge: an alternative to RL alignment

Seth Herd15 Jan 2024 21:52 UTC

39 points

17 comments7 min readLW link

Seth Herd 26 Mar 2024 18:00 UTC
37 points
31
in reply to: habryka’s comment on: My Interview With Cade Metz on His Reporting About Slate Star Codex
I thought he just meant “criticism is good, actually; I like having it done to me so I’m going to do it to you”, and was saying that rationalists tend to feel this way.

Seth Herd 6 Dec 2023 19:20 UTC
37 points
20
in reply to: 1a3orn’s comment on: Based Beff Jezos and the Accelerationists
Welp, that looks like one central crux right there:
No need to worry about creating “zombie” forms of higher intelligence, as these will be at a thermodynamic/evolutionary disadvantage compared to conscious/higher-level forms of intelligence
I think the most important thing to note is that this hasn’t been part of enough discussions to make it into Zvi’s summary. What is happening sounds like the worst sort of polarization. Ad hominem attacks create so much mutual irritation that the discussion bogs down entirely. This effect is surprisingly powerful. I think polarization is the mind-killer.
On the object level, I think that question is interesting, and not clear-cut.

Seth Herd 12 Oct 2023 0:33 UTC
37 points
15
in reply to: Roko’s comment on: Sam Altman’s sister, Annie Altman, says Sam has (severely) abused her
You’re assuming the two alternatives are that everything she’s said is true and accurate, or else nothing is. It does not require psychosis to make wrong interpretations or to have mild paranoia. It merely requires not being a dedicated rationalist, and/or having a hard life. I’m pretty sure that being abused would help cause paranoia, helping her to get some stuff wrong.

Unfortunately, it’s going to be impossible to disentangle this without more specific evidence. Psychology is complicated. Both real recovered memories and fabricated memories seem to be common.

You didn’t bother estimating the base rate of sexual abuse by siblings. While that’s very hard to figure out, it’s very likely in the same neighborhood as your 1-3% psychosis. And it’s even harder to study or estimate. So this isn’t going to help much in resolving the issue.

Seth Herd 30 May 2023 20:24 UTC
37 points
20
on: The bullseye framework: My case against AI doom
Upvoted for making well-argued and clear points.

I think what you’ve accomplished here is eating away at the edges of the AGI x-risk argument. I think you argue successfully for longer timelines and a lower P(doom). Those timelines and estimates are shared by many of us who are still very worried about AGI x-risk.

Your arguments don’t seem to address the core of the AGI x-risk argument.

You’ve argued against many particular doom scenarios, but you have not presented a scenario that includes our long term survival. Sure, if alignment turns out to be easy we’ll survive; but I only see strong arguments that it’s not impossible. I agree, and I think we have a chance; but it’s just a chance, not success by default.

I like this statement of the AGI x-risk arguments. It’s my attempt to put the standard arguments of instrumental convergence and capabilities in common language:

Something smarter than you will wind up doing whatever it wants. If it wants something even a little different than you want, you’re not going to get your way. If it doesn’t care about you even a little, and it continues to become more capable faster than you do, you’ll cease being useful and will ultimately wind up dead. Whether you were eliminated because you were deemed dangerous, or simply outcompeted doesn’t matter. It could take a long time, but if you miss the window of having control over the situation, you’ll still wind up dead.

This could of course be expanded on ad infinitum, but that’s the core argument, and nothing you’ve said (on my quick read, sorry if I’ve missed it) addresses any of those points.

There were (I’ve been told) nine other humanoid species. They are all dead. The baseline outcome of creating something smarter than you is that you are outcompeted and ultimately die out. The baseline of assuming survival seems based on optimism, not reason.

So I agree that P(doom) is less than 99%, but I think the risk is still very much high enough to devote way more resources and caution than we are now.

Some more specific points:

Fanatical maximization isn’t necessary for doom. An agent with any goal still invokes instrumental convergence. It can be as slow, lazy, and incompentent as you like. The only question is if it can outcompete you in the long run.

Humans are somewhat safe (but think about the nuclear standoff; I don’t think we’re even self-aligned in the medium term). But there are two reasons: humans can’t self-improve very well; AGI has many more routes to recursive self-improvement. On the roughly level human playing field, cooperation is the rational policy. In a scenario where you can focus on self-improvement, cooperation doesn’t make sense long-term. Second, humans have a great deal of evolution to make our instincts guide us toward cooperation. AGI will not have that unless we build it in, and we have only very vague ideas of how to do that.

Loose initial alignment is way easier than a long-term stable alignment. Existing alignment work barely addresses long-term stability.

A balance of power in favor of aligned AGI is tricky. Defending against misaligned AGI is really difficult.

Thanks so much for engaging seriously with the ideas, and putting time and care into communicating clearly!
What links here?
- Vladimir_Nesov's comment on The bullseye framework: My case against AI doom by titotal (30 May 2023 22:25 UTC; 5 points)
- Seth Herd's comment on The bullseye framework: My case against AI doom by titotal (31 May 2023 1:27 UTC; 1 point)

Seth Herd 11 Mar 2024 23:16 UTC
36 points
2
on: Some (problematic) aesthetics of what constitutes good work in academia
I think the structure of Alignment Forum vs. academic journals solves a surprising number of the problems you mention. It creates a different structure for both publication and prestige. More on this at the end.

It was kind of cathartic to read this. I’ve spent some time thinking about the inefficiencies of academia, but hadn’t put together a theory this crisp. My 23 years in academic cognitive psychology and cognitive neuroscience would have been insanely frustrating if I hadn’t been working on lab funding. I resolved going in that I wasn’t going to play the publish-or-perish game and jump through a bunch of strange hoops to do what would be publicly regarded as “good work”.

I think this is a good high-level theory of what’s wrong with academia. I think one problem is that academic fields don’t have a mandate to produce useful progress, just progress. It’s a matter of inmates running the asylum. This all makes some sense, since the routes to making useful progress aren’t obvious, and non-experts shouldn’t be directly in charge of the directions of scientific progress; but there’s clearly something missing when no one along the line has more than a passing motivation to select problems for impact.

Around 2006 I heard Tal Yarkoni, a brilliant young scientist, give a talk on the structural problems of science and its publication model. (He’s now ex-scientist as many brilliant young scientists become these days). The changes he advocated were almost precisely the publication and prestige model of the Alignment Forum. It allows publications of any length and format, and provides a public time stamp for when ideas were contributed and developed. It also provides a public record, in the form of karma scores, for how valuable the scientific community found that publication. This only works in a closed community of experts, which is why I’m mentioning AF and not LW. One’s karma score is publicly visible as a sum-total-of-community-appreciation of that person’s work.

This public record of appreciation breaks an important deadlocking incentive structure in the traditional scientific publication model: If you’re going to find fault with a prominent theory, your publication of it had better be damned good (or rather “good” by the vague aesthetic judgments you discuss). Otherewise you’ve just earned a negative valence from everyone who likes that theory and/or the people that have advocated it, with little to show for it. I think that’s why there’s little market for the type of analysis you mention, in which someone goes through the literature in painstaking detail to resolve a controversy in the litterature, and then finds no publication outlet for their hard work.

This is all downstream of the current scientific model that’s roughly an advocacy model. As in law, it’s considered good and proper to vigorously advocate for a theory even if you don’t personally think it’s likely to be true. This might make sense in law, but in academia it’s the reason we sometimes say that science advances one funeral at a time. The effect of motivated reasoning combined with the advocacy norm cause scientists to advocate their favorite wrong theory unto their deathbed, and be lauded by most of their peers for doing so.

The rationalist stance of asking that people demonstrate their worth by changing their mind in the face of new evidence is present in science, but it seemed to me much less common than the advocacy norm. This rationalist norm provides partial resistance to the effects of motivated reasoning. That is worth it’s own post, but I’m not sure I’ll get around to writing it before the singularity.

These are all reasons that the best science is often done outside of academia.

Anyway, nice thought-provoking article.

Seth Herd 7 May 2023 18:37 UTC
34 points
13
on: TED talk by Eliezer Yudkowsky: Unleashing the Power of Artificial Intelligence
This is much better than any of his other speaking appearances. The short format, and TED’s excellent talk editing/coaching, have really helped.

This is still terrible.

I thought it was a TEDx talk, and I thought it was perhaps the worst TEDx talk I’ve seen. (I agree that it’s rare to see a TEDx talk with good content, but the deliveries are usually vastly better than this).

I love Eliezer Yudkowsky. He is the reason I’m in this field, and I think he’s one of the smartest human beings alive. He is also one of the best-intentioned people I know. This is not a critique of Yudkowsky as an individual.

He is not a good public speaker.

I’m afraid having him as the public face of the movement is going to be devastating. The reactions I see to his public statements indicate that he is creating polarization. His approach makes people want to find reasons to disagree with him. And individuals motivated to do that will follow their confirmation bias to focus on counterarguments.

I realize that he had only a few days to prepare this. That is not the problem. The problem is a lack of public communication skills. Those are very different than communicating with your in-group.

Yudkowsky should either level up his skills, rapidly, or step aside.

There are many others with more talent and skills for this type of communication.

Eliezer is rapidly creating polarization around this issue, and that is very difficult to undo. We don’t have time to do that.

Could we bull through with this approach, and rely on the strength of the arguments to win over public opinion? That might work. But doing that instead of actually thinking about strategy and developing skills would hurt our odds of survival, perhaps rather badly.

I’ve been afraid to say this in this community. I think it needs to be said.

We have promising alignment plans with low taxes

Seth Herd10 Nov 2023 18:51 UTC

30 points

9 comments5 min readLW link

Seth Herd 12 Dec 2023 18:56 UTC
30 points
18
on: Significantly Enhancing Adult Intelligence With Gene Editing May Be Possible
Regulation and complexity of effects seem like another two big blockers.

Effects of genes are complex. Knowing a gene is involved in intelligence doesn’t tell us what it does and what other effects it has.

I wouldn’t accept any edits to my genome without the consequences being very well understood (or in a last-ditch effort to save my life). I’d predict severe mental illness would happen alongside substantial intelligence gains.

Source: research career as a computational cognitive neuroscientist.

I put this as a post- ASI technology, but that’s also a product of my relatively short timelines.

Seth Herd 23 Oct 2023 23:02 UTC
29 points
5
on: Arguments for radical optimism on AI Alignment
Downvote for being absurdly overconfident, and thereby harming the whole direction of more optimism on alignment. I’d downvote Eliezer for the same reason on his 99.99% doom arguments in public; they are visibly silly, making the whole direction seem silly by association.

In both cased, there are too many unknown unknowns to have confidences remotely that high. And you’ve added way more silly zeros than EY, despite having looser arguments.

This is a really important topic; we need serious discussion of how to really think about alignment difficulty. This is a serious attempt, but it’s just not realistically humble. It also seems to be ignoring the cultural norm and explicit stated goal of writing to inform, not to persuade, on LW.

So, I look forward to your next iteration, improved by the feedback on this post!

Seth Herd 2 Sep 2023 6:58 UTC
29 points
15
on: One Minute Every Moment
I’m not sure what the takeaway is here, but these calculations are highly suspect. What a memory athlete can memorize (in their domain of expertise) in 5 minutes is an intricate mix of working memory and long-term semantic memory, and episodic (hippocampal) memory.
This is a very deep topic. Reading comprehension researchers have estimated the size of working memory as “unlimited”, but that’s obviously specific to their methods of measurement.
Modern debates on working memory capacity are 1-4 items. 7 was specific to what is now known as the phonological loop, which is subvocally reciting numbers. The strong learned connections between auditory cortex and verbal motor areas gives this a slight advantage over working memory for material that hasn’t been specifically practiced a lot.
See the concept of exformation, incidentally from one of the best books I’ve found on consciousness. Bits of information encoded by a signal to a sophisticated system is intricately intermixed with that system’s prior learning. It’s a type of compression. Not making a call at a specific time can encode a specific signal of unlimited length, if sender and receiver agree to that meaning.
Sorry for the lack of citations. I’ve had my head pretty deeply into this stuff in the past, but I never saw the importance of getting a precise working memory capacity estimate. The brain mechanisms are somewhat more interesting to me, but for different reasons than estimating capacity (they’re linked to goals and reward system operation, since working memory for goals and strategy is probably how we direct behavior in the short term).