Seth Herd comments on A Case for the Least Forgiving Take On Alignment

Seth Herd 6 May 2023 19:49 UTC
LW: 5 AF: 3
−1
AF
I think this is insightful pointing correctly to a major source of bifurcation in p(doom) estimates. I view this as the old guard vs. new wave perspectives on alignment.
Unfortunately, I mostly agree with these positions. I’m afraid a lack of attention to these claims may be making the new wave of alignment thinkers more optimistic than is realistic. I do partially disagree with some of these, and that makes my p(doom) a good bit lower than the MIRI 99%. But it’s not enough to make me truly optimistic. My p(doom) is right around the 50% “who knows” mark.
I’ll restate the main claims as:
1. We only get one chance
2. We have to get the AIs values exactly aligned with human values
  1. There will be a sharp discontinuity as an AI becomes truly generally intelligent
  2. the process of value reflection seems highly unstable
3. No known method of dodging this problem is likely to work
The source of all most my disagreement with you is in the type of AGI we expect. I expect (with above 50% probably) AGI to arise from the expansion of LLMs into language model based cognitive architectures that use LLMs as the core engine, but expand on them in a chain-of-thought, and allow them to use external tools. These expectations are entirely theoretical since AutoGPT and HuggingGPT were only released about a month or so ago. My post Capabilities and alignment of LLM cognitive architectures elaborates on why I expect these to work well.
I think such systems will readily become weakly general (at variance from your expectation of a more all-or-nothing transition) by learning about new domains through web search and experimentation with their cognitive tools, storing that knowledge in episodic memory. (I also think that before long, they will use that episodic, declarative knowledge to fine-tune the central LLM, much as humans absorb new knowledge into skills). Importantly, I expect this generality to extend to understanding themselves as systems, and thereby giving rise to something like value reflection.
This is bad because it advances timelines if true, but really good in that such systems can be run without using any RL or persistent context in one LLM.
None of the above considerations are in that post; I’m writing another that focuses on them.
In that scenario, I expect us to get a few shots, as the transition to truly general will be slow and happen in highly interpretable natural language agent systems. There are still many dangers, but I think this would massively improve our odds.
Whether or not AGI arises from that or a different network-based system, I agree that the value reflection process is unpredictable, so we may have to get value alignment exactly right. I expect the central strongest value to be preserved in a reflective value-editing process. But that means that central value has to be exactly right. Whether any broader configuration of values might be stable in a learning network is unknown, and I think worthy of a good deal more thought.
One random observation: I think your notion of general intelligence overlaps strongly with the common concept of recursive self improvement, which many people do include in their mental models.
Anyway, thanks for an insightful post that nails a good deal of the variance between my model of the average alignment optimist and pessimist.
- RogerDearnaley 5 Dec 2023 10:41 UTC
  LW: 1 AF: 1
  0
  AF Parent
  We have to get the AIs values exactly aligned with human values
  This is a major crux for me, and one of the primary reasons my P(DOOM) isn’t >90%. If you use value learning, you only need to get your value learner aligned well enough for it to a) start inside the region of convergence to true human values (i.e. it needs some passable idea what the words ‘human’ and ‘values’ mean and what the definition of “human values” is, like any small LLM has), and b) not kill everyone while it’s learning the details, and it will do its research and Bayesianly converge on human values (and if it’s not capable of competently being Bayesian enough to do that, it’s not superhuman, at least at STEM). So, if you use value learning, the only piece you need to get exactly right (for outer alignment) is the phrasing of the terminal goal saying “Use value learning”. For something containing an LLM, I think that might be about one short paragraph of text, possibly with one equation in it. The prospect of getting one paragraph of text and one equation right, with enough polishing and peer review, doesn’t actually seem that daunting to me.