I suspect that selection effects can be dealt with by easy access to a ground truth. One wouldn’t need to be Einstein to calculate how Mercury’s perihelion would behave according to Newton’s theory. In reality the perihelion rotates at a different rate with no classical reason in sight, so Newton’s theory had to be replaced by something else.
Nutrition studies and psychology studies are likely difficult because they require careful approach to avoid a biased selection of people subjected to the investigation. And social studies are, as their name suggests, supposed to study the evolution of whole societies and find it hard to construct a big data set or to create a control group. In addition, humanities-related fields[1] could also be easier affected by death spirals.
Returning to AI alignment, we don’t have any AGIs yet, we only have philosophical arguments which arguably prevent some alignment methods and/or targets[2] from scaling to the ASI. However, we do have LLMs which we can finetune and whose outputs and chains of thought we can read.
Studying the LLMs already yields worrying results, including the facts that LLMs engage in things like self-preservation, alignment faking or reward hacking.
Nutrition studies are supposed to predict how food consumed by a person affects the person’s health. I doubt that one can use such studies for ideology-related goals or that these studies can be affected by a death spiral.
For what it’s worth, I agree that empirical results have made me worry more relative to last year, and it’s part of the reason I no longer have p(doom) below 1-5%.
But there are other important premises which I don’t think are supported well by empirics, and are arguably load-bearing for the confidence that people have.
One useful example from Paul Christiano is there’s a conflation between solving the alignment problem on the first critical try, and not being able to experiment at all, and while this makes AI governance way harder, it doesn’t make the science problem nearly as difficult:
Eliezer often equivocates between “you have to get alignment right on the first ‘critical’ try” and “you can’t learn anything about alignment from experimentation and failures before the critical try.” This distinction is very important, and I agree with the former but disagree with the latter. Solving a scientific problem without being able to learn from experiments and failures is incredibly hard. But we will be able to learn a lot about alignment from experiments and trial and error; I think we can get a lot of feedback about what works and deploy more traditional R&D methodology. We have toy models of alignment failures, we have standards for interpretability that we can’t yet meet, and we have theoretical questions we can’t yet answer.. The difference is that reality doesn’t force us to solve the problem, or tell us clearly which analogies are the right ones, and so it’s possible for us to push ahead and build AGI without solving alignment. Overall this consideration seems like it makes the institutional problem vastly harder, but does not have such a large effect on the scientific problem.
I suspect that selection effects can be dealt with by easy access to a ground truth. One wouldn’t need to be Einstein to calculate how Mercury’s perihelion would behave according to Newton’s theory. In reality the perihelion rotates at a different rate with no classical reason in sight, so Newton’s theory had to be replaced by something else.
Nutrition studies and psychology studies are likely difficult because they require careful approach to avoid a biased selection of people subjected to the investigation. And social studies are, as their name suggests, supposed to study the evolution of whole societies and find it hard to construct a big data set or to create a control group. In addition, humanities-related fields[1] could also be easier affected by death spirals.
Returning to AI alignment, we don’t have any AGIs yet, we only have philosophical arguments which arguably prevent some alignment methods and/or targets[2] from scaling to the ASI. However, we do have LLMs which we can finetune and whose outputs and chains of thought we can read.
Studying the LLMs already yields worrying results, including the facts that LLMs engage in things like self-preservation, alignment faking or reward hacking.
Nutrition studies are supposed to predict how food consumed by a person affects the person’s health. I doubt that one can use such studies for ideology-related goals or that these studies can be affected by a death spiral.
SOTA discourse around AI alignment assumes that the AI can be aligned to nearly every imaginable target, including ensuring that its hosts grab absolute power and hold it forever without even needing to care about others. Which is amoral, but one conjecture in the AI-2027 forecast has Agent-3 develop moral reasoning.
For what it’s worth, I agree that empirical results have made me worry more relative to last year, and it’s part of the reason I no longer have p(doom) below 1-5%.
But there are other important premises which I don’t think are supported well by empirics, and are arguably load-bearing for the confidence that people have.
One useful example from Paul Christiano is there’s a conflation between solving the alignment problem on the first critical try, and not being able to experiment at all, and while this makes AI governance way harder, it doesn’t make the science problem nearly as difficult:
From this list of disagreements
I mostly agree with the rest of your comment.