Thomas Kwa comments on Current AIs Provide Nearly No Data Relevant to AGI Alignment

Thomas Kwa 4 Jan 2024 21:36 UTC
LW: 13 AF: 3
9
AF
- Behaving nicely is not the key property I’m observing in LLMs. It’s more like steerability and lack of hidden drives or goals. If GPT4 wrote code because it loved its operator, and we could tell it wanted to escape to maximize some proxy for the operator’s happiness, I’d be far more terrified.
- This would mean little if LLMs were only as capable as puppies. But LLMs are economically useful and capable of impressive intellectual feats, and still steerable.
- I don’t think LLMs are super strong evidence about whether big speedups to novel science will be possible without dangerous consequentialism. For me it’s like 1.5:1 or 2:1 evidence. One should continually observe how incorrigible models are at certain levels of capability and generality and update based on this, increasing the size of one’s updates as systems get more similar to AGI, and I think the time to start doing this was years ago. AlphaGo was slightly bad news. GPT2 was slightly good news.
  - If you haven’t started updating yet, when will you start? The updates should be small if you have a highly confident model of what future capabilities require dangerous styles of thinking, but I don’t think such confidence is justified.
What links here?
- Thomas Kwa's comment on Thomas Kwa’s Shortform by Thomas Kwa (10 Apr 2024 8:30 UTC; 6 points)