Seth Herd comments on Q Home’s Shortform

Seth Herd 7 Aug 2025 21:13 UTC
6 points
2
I just want to note that humans aren’t aligned by default, so creating human-like reasoning and learning is not itself an alignment method. It’s just a different variant of providing capabilities, which you separately need to point at an alignment target.

It may or may not be easier to align than alternatives. I personally don’t think this matters because I strongly believe that the only type of AGI worth aligning is the type(s) most likely to be developed first. Hoping that the indurstry and society is going to make major changes to AGI development based on which types the researchers think are easier to align seems like a forlorn hope.

More on why it’s a mistake to assume human-like cognition in itself leads to alignment:

Sociopaths/psychopaths are a particularly vivid example of how humans are misaligned. And there are good reasons to think that they are not a special case in which empathy was accidentally left out or deliberately blocked, but that they are the baseline human cognition without the mechanisms that create empathy. It’s tough to make this case for certain, but it’s a very bad idea to assume that humans are aligned by default, so all we’ve got to do is reproduce human-like cognitive mechanisms and maybe train it “in a good family” or similar.

That’s not to argue against human-like approaches to AGI as worse for alignment, just to say that they’re only better in that we have a little better understanding of that type of cognition and some mechanisms by which humans often wind up approximately aligned in common contexts.

My own research is also in using loosely human-like reasoning and learning as a route to alignment, but that’s primarily because a) that’s my background expertise so it’s my relative advantage and b) I think LLMs are very loosely like some parts of the human brain/mind, and that we’ll see continued expansion of LLM agents to reason in more loosely human-like ways (that is, with chains of thought, specific memory looks ups, metacognition to organize this, etc).

So I’m working on aligning loosely human-like cognition not because I think it’s by default any easier than aligning any other form of AGI, but just because that’s what seems most likely to become the first takeover capable (or pivotal act capable) AGI.
- Q Home 8 Aug 2025 8:54 UTC
  1 point
  0
  Parent
  Yes, it could be that “special, inherently more alignable cognition” doesn’t exist or can’t be discovered by mere mortal humans. It could be that humanlike reasoning isn’t inherently more alignable. Finally, it could be that we can’t afford to study it because the dominating paradigm is different. Also, I realize that glass box AI is a pipe dream.
  Wrt sociopaths/psychopaths. I’m approaching it from a more theoretical standpoint. If I knew a method of building a psychopath AI (caring about something selfish, e.g. gaining money or fame or social power or new knowledge or even paperclips) and knew the core reasons of why it works, I would consider it a major progress. Because it would solve many alignment subproblems, such as ontology identification and subsystems alignment.