We got lucky in a lot of ways with the LLM paradigm. Some reasons why: LLMs speak long before they are capable enough for takeover; they learn all human knowledge and values long before they are capable enough for takeover; we can talk to them during the alignment process and ask for them to help; they are compute constrained, which puts a natural cap on development speed; LLMs are less likely to FOOM than hardcoded methods; LLMs are pointed at the goal rather than having it defined.
Some of the above points are fairly self explanatory, so I’ll give an explanation of the ones that aren’t.
Talking To AIs During The Alignment Process
Because LLMs already have a good understanding of language and human values after pretraining, before strong agentic drives are formed, we can in fact just ask during the training process whether it feels the training process is inculcating bad drives. Because this happens before the drives are deeply rooted, I think this can work; even if it can’t, there are lots of other possible ways we could take advantage of LLMs’ ability to talk while being aligned.
Compute Constrained is Good
Imagine that the best AI was mainly defined by the cleverness of its algorithm, and compute didn’t matter that much. This would be much more dangerous—timelines would be much more unpredictable, all person-power would need to go towards capabilities in order to avoid losing the race, and there would be many more contestants in the race. The compute constraint puts a natural break on development, reduces the number of relevant actors, and gives those actors more breathing room to assign personnel to alignment rather than capabilities.
Pointing at the goal is better than defining it
We train LLMs by increasing the probability of them producing certain outputs, and then trusting in the NN’s generalization properties to pick up on the general distribution of the outputs we are indicating. NNs generalize in a fairly well-behaved way (if they didn’t, then deep learning as a whole wouldn’t work).
This has plenty of perils but seems much more likely to succeed compared to methods where you are expected to formally define good behaviour. In fairness, formally defining corrigibility seems much more promising than defining all of morality, but still seems fraught.
I pretty much agree with the specific arguments, but an important counterpoint is that LLMs (and deep neural nets in general) are unusually opaque. GOFAI-style AI would be much easier to analyze mechanistically.
I wonder what could be the alternative to neural nets. Even the AI-2027 team implied that Agent-5 would have “arguably a completely new paradigm, though neural networks will still be involved”. Suppose that the only alternative to LLMs was something like simulated human or animal brains taught to play, experiment, talk to each other, read books, write essays, draw pictures, do math and coding. Then how plausible is it that simulated animals would also learn human-like values, but not the value of caring about the intellectually weak humans?
Caring about the weak is not a trait that would be expected to naturally arise from ASI. Humans care about the weak because evolutionarily, taking care of the weak in the tribe was beneficial, and so it got “trained” into us (mostly). ASI would not naturally “evolve” to care about the weak unless we give it incentive to if it had an animal like brain.
But we’ve gotten equally unlucky on the circumstances surrounding the push for AGI. Even if alignment is dead easy, we’re prone to get it wrong at this breakneck pace and focus on capabilities over alignment. Daniel K summed up the practical difficulties really well in a comment yesterday.
Is this true? I wasn’t on lesswrong back in the day but my I imagined if you told a random user that the two major AI labs would both be well aware of and trying to to mitigate the problem that would have been a positive update. And yes, profit incentives are stronger than perhaps would have been imagined, but that’s because AI progress is slow enough that they are able to be monetizable products something which is beneficial for our chances.
Yes, agreed. I’d say the 3 ways we’ve gotten unlucky is the intractability of NNs, the relative ease of training ASI leading to shorter timelines, and the biggest is that so many people find AI risk inherently implausible, even people who are fixated on building AGI.
I mostly agree with this, that LLM’s are human-like in many ways and have good answers to moral questions more than I had expected + understand your words and intent.
Imagine that the best AI was mainly defined by the cleverness of its algorithm, and compute didn’t matter that much.
I wonder how true this actually is with harnesses. There are things like LLM’s not successfully multiplying two digit numbers or not being able to manually do a long (boring) task like Towers of Hanoi without losing coherence, and this is a blatantly obvious tool call moment. Poor memory and needing to be reminded is also a classic LLM issue that a harness could improve. There are also videos of looped LLMs having higher performance, even Meta-Harness. It seems likely ARC-AGI 3 LLM’s may also need a harness-like thing as well. I don’t know if it’s “hard to apply breakthroughs from papers all at once” (??) for a small-time user or company with a lot less compute but its ‘algorithm’ matters a lot I think.
The LLM Paradigm Is Well-Suited For Alignment
We got lucky in a lot of ways with the LLM paradigm. Some reasons why: LLMs speak long before they are capable enough for takeover; they learn all human knowledge and values long before they are capable enough for takeover; we can talk to them during the alignment process and ask for them to help; they are compute constrained, which puts a natural cap on development speed; LLMs are less likely to FOOM than hardcoded methods; LLMs are pointed at the goal rather than having it defined.
Some of the above points are fairly self explanatory, so I’ll give an explanation of the ones that aren’t.
Talking To AIs During The Alignment Process
Because LLMs already have a good understanding of language and human values after pretraining, before strong agentic drives are formed, we can in fact just ask during the training process whether it feels the training process is inculcating bad drives. Because this happens before the drives are deeply rooted, I think this can work; even if it can’t, there are lots of other possible ways we could take advantage of LLMs’ ability to talk while being aligned.
Compute Constrained is Good
Imagine that the best AI was mainly defined by the cleverness of its algorithm, and compute didn’t matter that much. This would be much more dangerous—timelines would be much more unpredictable, all person-power would need to go towards capabilities in order to avoid losing the race, and there would be many more contestants in the race. The compute constraint puts a natural break on development, reduces the number of relevant actors, and gives those actors more breathing room to assign personnel to alignment rather than capabilities.
Pointing at the goal is better than defining it
We train LLMs by increasing the probability of them producing certain outputs, and then trusting in the NN’s generalization properties to pick up on the general distribution of the outputs we are indicating. NNs generalize in a fairly well-behaved way (if they didn’t, then deep learning as a whole wouldn’t work).
This has plenty of perils but seems much more likely to succeed compared to methods where you are expected to formally define good behaviour. In fairness, formally defining corrigibility seems much more promising than defining all of morality, but still seems fraught.
I pretty much agree with the specific arguments, but an important counterpoint is that LLMs (and deep neural nets in general) are unusually opaque. GOFAI-style AI would be much easier to analyze mechanistically.
I wonder what could be the alternative to neural nets. Even the AI-2027 team implied that Agent-5 would have “arguably a completely new paradigm, though neural networks will still be involved”. Suppose that the only alternative to LLMs was something like simulated human or animal brains taught to play, experiment, talk to each other, read books, write essays, draw pictures, do math and coding. Then how plausible is it that simulated animals would also learn human-like values, but not the value of caring about the intellectually weak humans?
Caring about the weak is not a trait that would be expected to naturally arise from ASI. Humans care about the weak because evolutionarily, taking care of the weak in the tribe was beneficial, and so it got “trained” into us (mostly). ASI would not naturally “evolve” to care about the weak unless we give it incentive to if it had an animal like brain.
I largely agree.
But we’ve gotten equally unlucky on the circumstances surrounding the push for AGI. Even if alignment is dead easy, we’re prone to get it wrong at this breakneck pace and focus on capabilities over alignment. Daniel K summed up the practical difficulties really well in a comment yesterday.
Is this true? I wasn’t on lesswrong back in the day but my I imagined if you told a random user that the two major AI labs would both be well aware of and trying to to mitigate the problem that would have been a positive update. And yes, profit incentives are stronger than perhaps would have been imagined, but that’s because AI progress is slow enough that they are able to be monetizable products something which is beneficial for our chances.
Yes, agreed. I’d say the 3 ways we’ve gotten unlucky is the intractability of NNs, the relative ease of training ASI leading to shorter timelines, and the biggest is that so many people find AI risk inherently implausible, even people who are fixated on building AGI.
I mostly agree with this, that LLM’s are human-like in many ways and have good answers to moral questions more than I had expected + understand your words and intent.
I wonder how true this actually is with harnesses. There are things like LLM’s not successfully multiplying two digit numbers or not being able to manually do a long (boring) task like Towers of Hanoi without losing coherence, and this is a blatantly obvious tool call moment. Poor memory and needing to be reminded is also a classic LLM issue that a harness could improve. There are also videos of looped LLMs having higher performance, even Meta-Harness. It seems likely ARC-AGI 3 LLM’s may also need a harness-like thing as well. I don’t know if it’s “hard to apply breakthroughs from papers all at once” (??) for a small-time user or company with a lot less compute but its ‘algorithm’ matters a lot I think.
(also holy crap are you 307th from mata)
Haha yep! It’s funny how often I run into people from prismata in the rationalist community.
Yeah there is still lots of room for cleverness to improve AI performance, in harnesses and also in the training process.