I disagree with the claims by Habryka and Ben Pace that your impact on AI wasn’t positive and massive, and here’s why.
My reasons for disagreement with Habryka and Ben Pace on their impact largely derive from me being way more optimistic on AI risk and AI Alignment than I used to be, which implies Habryka and Ben Pace had way more positive impact than they thought.
Some of my reasons why I became more optimistic, such that the chance of AI doom was cut to 1-10% from a prior 80%, come down to the following:
I basically believe that deceptive alignment won’t happen with high probability, primarily because models will understand the base goal first before having world modeling.
I believe the other non-iterable problem, that is Goodhart, largely isn’t a problem, and the evidence that Goodhart’s law so severely impacts human society as to make customers buy subpar profit has moderate to strong evidence against it.
For the air conditioner example John Wentworth offered on how Goodhart’s law operates in the consumer market, the comment section, in particular anonymousaisafety and Paul Christiano seems to have debunked the thesis that Goodhart’s law is particularly severe, and gave evidence that no Goodharting is happening at all. Now the reason I am so strong on the claim is that John Wentworth, to his immense credit, was honest on how much cherry picking he did in the post, and the fact that even the cherry picked example shows a likely case of no Goodhart at all implies the pointers problem was solved for that case, and also implies that the pointers problem is quite a bit easier to solve than John Wentworth thought.
I agree with jsteinhardt that empirical evidence generalizes surprisingly far, and that this implies that the empirical work OpenAI et al is doing is very valuable for AI safety and alignment.
I have meta priors that things are usually less bad than we feared and more good than we hope. In particular, I think people are too prone to overrate pessimism and underrated optimism.
In conclusion, I disagree with Habryka and Ben Pace on the sign of their impacts as well as how much they impacted the AI Alignment space (I think it was positive and massive).
I do think closing the Lightcone offices are good, but I disagree with a major reason for why Habryka and Ben Pace are closing the offices, and that’s due to different models of AI risk.
primarily because models will understand the base goal first before having world modeling
Could you say a bit more about why you think this? My definitely-not-expert expectation would be that the world-modeling would come first, then the “what does the overseer want” after that, because that’s how the current training paradigm works: pretrain for general world understanding, then finetune on what you actually want the model to do.
Admittedly, I got that from Deceptive alignment is <1% likely post.
Even if you don’t believe that post, Pretraining from human preferences shows that alignment with human values can be instilled first as a base goal, thus outer aligning it, before giving it world modeling capabilities, works wonders for alignment and has many benefits compared to RLHF.
Given the fact that it has a low alignment tax, I suspect that there’s a 50-70% chance that this plan, or a successor will be adopted for alignment.
My reasons for disagreement with Habryka and Ben Pace on their impact largely derive from me being way more optimistic on AI risk and AI Alignment than I used to be, which implies Habryka and Ben Pace had way more positive impact than they thought.
You’re using a time difference as evidence for the sign of one causal arrow?
I don’t understand what you’re saying, but what I was saying is that I used to be much more pessimistic around how hard AI Alignment was, and a large part of the problem of AI Alignment is that it’s not very amenable to iterative solutions. Now, however, I believe that I was very wrong on how hard alignment ultimately turned out to be, and in retrospect, that means that the funding of AI safety research is much more positive, since I now give way higher chances to the possibility that empirical, iterative alignment is enough to solve AI Alignment.
Sure, but why do you think that means they had a positive impact? Even if alignment turns out to be easy instead of hard, that doesn’t seem like it’s evidence that Lightcone had a positive impact.
[I agree a simple “alignment hard → Lightcone bad” model gets contradicted by it, but that’s not how I read their model.]
My reading is that Noosphere89 thinks that Lightcone has helped in bringing in/upskiling a number of empirical/prosaic alignment researchers. In worlds where alignment is relatively easy, this is net positive as the alignment benefits are higher than the capabilities costs, while in worlds where alignment is very hard, we might expect the alignment benefits to be marginal while the capabilities costs continue to be very real.
I did argue that closing the Lightcone offices was the right thing, but my point is that part of the reasoning relies on a core assumption that AI Alignment isn’t very iterable and will generally cost capabilities that I find probably false.
I am open to changing my mind, but I see a lot of reasoning on AI Alignment that is kinda weird to me by Habryka and Ben Pace.
I disagree with the claims by Habryka and Ben Pace that your impact on AI wasn’t positive and massive, and here’s why.
My reasons for disagreement with Habryka and Ben Pace on their impact largely derive from me being way more optimistic on AI risk and AI Alignment than I used to be, which implies Habryka and Ben Pace had way more positive impact than they thought.
Some of my reasons why I became more optimistic, such that the chance of AI doom was cut to 1-10% from a prior 80%, come down to the following:
I basically believe that deceptive alignment won’t happen with high probability, primarily because models will understand the base goal first before having world modeling.
I believe the other non-iterable problem, that is Goodhart, largely isn’t a problem, and the evidence that Goodhart’s law so severely impacts human society as to make customers buy subpar profit has moderate to strong evidence against it.
For the air conditioner example John Wentworth offered on how Goodhart’s law operates in the consumer market, the comment section, in particular anonymousaisafety and Paul Christiano seems to have debunked the thesis that Goodhart’s law is particularly severe, and gave evidence that no Goodharting is happening at all. Now the reason I am so strong on the claim is that John Wentworth, to his immense credit, was honest on how much cherry picking he did in the post, and the fact that even the cherry picked example shows a likely case of no Goodhart at all implies the pointers problem was solved for that case, and also implies that the pointers problem is quite a bit easier to solve than John Wentworth thought.
Links below:
https://www.lesswrong.com/posts/MMAK6eeMCH3JGuqeZ/everything-i-need-to-know-about-takeoff-speeds-i-learned
https://www.lesswrong.com/posts/AMmqk74zWmvP8tXEJ/preregistration-air-conditioner-test
https://www.lesswrong.com/posts/5re4KgMoNXHFyLq8N/air-conditioner-test-results-and-discussion#pLLeDhJfPnYt7fXTH
I agree with jsteinhardt that empirical evidence generalizes surprisingly far, and that this implies that the empirical work OpenAI et al is doing is very valuable for AI safety and alignment.
https://www.lesswrong.com/posts/ekFMGpsfhfWQzMW2h/empirical-findings-generalize-surprisingly-far
I have meta priors that things are usually less bad than we feared and more good than we hope. In particular, I think people are too prone to overrate pessimism and underrated optimism.
In conclusion, I disagree with Habryka and Ben Pace on the sign of their impacts as well as how much they impacted the AI Alignment space (I think it was positive and massive).
I do think closing the Lightcone offices are good, but I disagree with a major reason for why Habryka and Ben Pace are closing the offices, and that’s due to different models of AI risk.
Could you say a bit more about why you think this? My definitely-not-expert expectation would be that the world-modeling would come first, then the “what does the overseer want” after that, because that’s how the current training paradigm works: pretrain for general world understanding, then finetune on what you actually want the model to do.
Admittedly, I got that from Deceptive alignment is <1% likely post.
Even if you don’t believe that post, Pretraining from human preferences shows that alignment with human values can be instilled first as a base goal, thus outer aligning it, before giving it world modeling capabilities, works wonders for alignment and has many benefits compared to RLHF.
Given the fact that it has a low alignment tax, I suspect that there’s a 50-70% chance that this plan, or a successor will be adopted for alignment.
Here’s the post:
https://www.lesswrong.com/posts/8F4dXYriqbsom46x5/pretraining-language-models-with-human-preferences
You’re using a time difference as evidence for the sign of one causal arrow?
I don’t understand what you’re saying, but what I was saying is that I used to be much more pessimistic around how hard AI Alignment was, and a large part of the problem of AI Alignment is that it’s not very amenable to iterative solutions. Now, however, I believe that I was very wrong on how hard alignment ultimately turned out to be, and in retrospect, that means that the funding of AI safety research is much more positive, since I now give way higher chances to the possibility that empirical, iterative alignment is enough to solve AI Alignment.
Sure, but why do you think that means they had a positive impact? Even if alignment turns out to be easy instead of hard, that doesn’t seem like it’s evidence that Lightcone had a positive impact.
[I agree a simple “alignment hard → Lightcone bad” model gets contradicted by it, but that’s not how I read their model.]
My reading is that Noosphere89 thinks that Lightcone has helped in bringing in/upskiling a number of empirical/prosaic alignment researchers. In worlds where alignment is relatively easy, this is net positive as the alignment benefits are higher than the capabilities costs, while in worlds where alignment is very hard, we might expect the alignment benefits to be marginal while the capabilities costs continue to be very real.
I did argue that closing the Lightcone offices was the right thing, but my point is that part of the reasoning relies on a core assumption that AI Alignment isn’t very iterable and will generally cost capabilities that I find probably false.
I am open to changing my mind, but I see a lot of reasoning on AI Alignment that is kinda weird to me by Habryka and Ben Pace.