For what it’s worth, I have grown pessimistic about our ability to solve the open technical problems even given 100 years of work on them. I think it possible but not probable in most plausible scenarios.
Correspondingly the importance I assign to increasing the intelligence of humans has drastically increased.
Correspondingly the importance I assign to increasing the intelligence of humans has drastically increased.
I feel like human intelligence enhancement would increase capabilities development faster than alignment development, maybe unless you’ve got a lot of discrimination in favor of only increasing the intelligence of those involved with alignment.
Maybe if they all have IQ 200+, they automatically realize that and rather work on alignment than on capabilities? Or come up with a pivotal act.
With Eliezer going [public](https://x.com/tsarnick/status/1882927003508359242) with the IQ enhancement motion he at least must think so? (because if done publicly it’ll initiate intelligence enhancement race between US, China and other countries; and that’d normally lead to AI capabilities speed-run unless the amplified people are automatically wiser than that)
Well as the first few pararagphs of the text suggests, the median ‘AI Safety’ advocate over time has been barely sentient, relative to other motivated groups, when it comes to preventing certain labels from being co-opted by those groups…. so it seems unlikely they will become so many standard deviations above average in some other aspect at any point in the future.
Because the baseline will also change in the future.
I’m not particularly resolute on this question. But I get this sense when I look at (a) the best agent foundations work that’s happened over ~10 years of work on the matter, and (b) the work output of scaling up the number of people working on ‘alignment’ by ~100x.
For the first, trying to get a better understand of the basic concepts like logical induction and corrigibility and low-impact and ontological updates, while I feel like there’s been progress (in timeless decision theory taking a clear step forward in figuring out how think about decision-makers as algorithms; in logical induction as moving forward on how to think about logical uncertainty; notably in the Embedded Agency sequence outlining many basic confusions; and in various writings like Radical Probabilism and Geometric Rationality in finding the breaking edges of expected utility maximization) I don’t feel like the work done over the last 10 years is on track to be a clear ~10% of the work needed.
I’m not confident it makes sense to try to count it linearly. But I don’t know that there’s enough edges here or new results to feel good about, given 10x as much time to think about it, a new paradigm / set of concepts falling into place.
For the second, I think mostly there’s been (as Wentworth would say) a lot of street-lighting, and a lot of avoiding of actually working on the problem. I mean, there’s definitely been a great amount of bias introduced by ML labs having billions of dollars and setting incentives, but I don’t feel confident that good things would happen in the absence of that. I’d guess that most ideas for straightforwardly increasing the number of people working on these problems will result in them bouncing off and doing unrelated things.
I think partly I’m also thinking that very few researchers cared about these problems in the last few decades before AGI seemed like a big deal, and still very few researchers seem to care about them, and when I’ve see researchers like Bengio and Sutskever talk about it’s looked to me like they bounce off / become very confident they’ve solved the problems while missing obvious things, so my sense is that it will continue to be a major uphill battle to get the real problems actually worked on.
Perhaps I should focus on a world where I get to build such a field and scale it slowly and set a lot of the culture. I’m not exactly sure how ideal of a setup I should be imagining. Given 100 years, I would give it my best shot. My gut right now says I’d have maybe a 25% chance of success, though if I have to deal with as much random bullshit as we have so far in this timeline (random example: my CEO being unable to do much leadership of Lightcone due to 9 months of litigation from the FTX fallout) then I am less confident.
My guess is that given 100 years I would be slightly more excited to try out the human intelligence enhancement storyline. But I’ve not thought about that one much, I might well update against it as I learn more of the details.
I don’t share the feeling that not enough of relevance has happened over the last ten years for us to seem on track for solving it in a hundred years, if the world’s technology[1] were magically frozen in time.
Some more insights from the past ten years that look to me like they’re plausibly nascent steps in building up a science of intelligence and maybe later, alignment:
We understood some of the basics of general pattern matching: How it is possible for embedded minds that can’t be running actual Solomonoff induction to still have some ability to extrapolate from old data to new data. This used to be a big open problem in embedded agency, at least to me, and I think it is largely solved now. Admittedly a lot of the core work here actually happened more than ten years ago, but people in ml or our community didn’t know about it. [1,2]
Some basic observations and theories about the internal structure of the algorithms neural networks learn, and how they learn them. Yes, our networks may be a very small corner of mind space, but one example is way better than no examples! There’s a lot on this one, so the following is just a very small and biased selection. Note how some of these works are starting to properly build on each other. [1,2,3,4,5,6,7,8,9,10,11,12]
Some theory trying to link how AIs work to how human brains work. I feel less able to evaluate this one, but if the neurology basics are right it seems quite useful. [1]
QACI. What I’d consider the core useful QACI insight maybe sounds kind of obvious once you know about it. But I, at least, didn’t know about it. Like, if someone had told me: “A formal process we can describe that we’re pretty sure would return the goals we want an AGI to optimise for is itself often a sufficient specification of those goals.” I would’ve replied: “Well, duh.” But I wouldn’t have realised the implication. I needed to see an actual example for that. Plausibly MIRI people weren’t as dumb as me here and knew this pre-2015, I’m not sure.
The mesa-optimiser paper. This one probably didn’t have much insight that didn’t already exist pre-2015. But I think it communicated something central about the essence of the alignment problem to many people who hadn’t realised it before. [1]
If we were a normal scientific field with no deadline, I would feel very good about our progress here. Particularly given how small we are. CERN costs ca. €1.2 billion a year, I think all the funding for technical work and governance over the past 20 years taken together doesn’t add up to one year of that. Even if at the end of it all we still had to get ASI alignment right on the first try, I would still feel mostly good about this, if we had a hundred years.
I would also feel better about the field building situation if we had a hundred years. Yes, a lot of the things people tried for field building over the past ten years didn’t work as well as hoped. But we didn’t try that many things, a lot of the attempts struck me as inadequate in really basic ways that seem fixable in principle, and I would say the the end result still wasn’t no useful field building. I think the useful parts of the field have grown quite a lot even in the past three years! Just not as much as people like John or me thought they would, and not as much as we probably needed them to with the deadlines we seem likely to have.
Not to say that I wouldn’t still prefer to do some human intelligence enhancement first, even if we had a hundred years. That’s just the optimal move, even in a world where things look less grim.
But what really kills it for me is just the sheer lack of time.
I hate to be insulting to a group of people I like and respect, but “the best agent foundations work that’s happened over ~10 years of work” was done by a very small group of people who, despite being very smart, certainly smarter than myself, aren’t academic superstars or geniuses (Edit to add: on a level that is arguably sufficient, as I laid out in my response below.) And you agree about this. The fact that they managed to make significant progress is fantastic, but substantial progress on deep technical problems is typically due to (ETA: only-few-in-a-generation level) geniuses, large groups of researchers tackling the problem, or usually both. And yes, most work on the topic won’t actually address the key problem, just like most work in academia does little or nothing to advance the field. But progress happens anyways, because intentionally or accidentally, progress on problems is often cumulative, and as long as a few people understand the problem that matters, someone usually actually notices when a serious advance occurs.
I am not saying that more people working on the progress and more attention would definitely crack the problems in the field this decade, but I certainly am saying that humanity as a whole hasn’t managed even what I’d consider a half-assed semi-serious attempt.
IDK if this is relevant to much, but anyway, given the public record, saying that Scott Garrabrant isn’t a genius is just incorrect. Sam Eisenstat is also a genius. Also Jessica Taylor I think. (Pace other members of AF such as myself.)
Apologies—when I said genius, I had a very high bar in mind, no more than a half dozen people alive today, who each have single-handedly created or materially advanced an entire field. And I certainly hold Scott in very high esteem, and while I don’t know Sam or Jessica personally, I expect they are within throwing distance—but I don’t think any of them meet this insanely high bar. And Scott’s views on this, at least from ca. 2015, was a large part of what informed my thinking about this; I can’t tell the difference between him and Terry Tao when speaking with them, but he can, and he said there is clearly a qualitative difference there. Similarly for other people clearly above my league, including a friend who worked with Thurston at Cornell back in 2003-5. (It’s very plausible that Scott Aaronson is in this bucket as well, albeit in a different areas, though I can’t tell personally, and have not heard people say this directly—but he’s not actually working on the key problems, and per him, he hasn’t really tried to work on agent foundations. Unfortunately.)
So to be clear, I think Scott is a genius, but not one of the level that is needed to single-handedly advance the field to the point where the problem might be solved this decade, if it is solvable. Yes, he’s brilliant, and yes, he has unarguably done a large amount of the most valuable work in the area in the past decade, albeit mostly more foundational that what is needed to solve the problem. So if we had another dozen people of his caliber at each of a dozen universities working on this, that would be at least similar in magnitude to what we have seen in fields that have made significant progress in a decade—though even then, not all fields like hat see progress.
But the Tao / Thurston level of genius, usually in addition to the above-mentioned 100+ top people working on the problem, is what has given us rapid progress in the past in fields where such progress was possible. This may not be one of those areas—but I certainly don’t expect that we can do much better than other areas with much less intellectual firepower, hence my above claim that humanity as a whole hasn’t managed even what I’d consider a half-assed semi-serious attempt at solving a problem that deserves an entire field of research working feverishly to try our best to actually not die—and not just a few lone brilliant researchers.
Oh ok lol. Ok on a quick read I didn’t see too much in this comment to disagree with.
(One possible point of disagreement is that I think you plausibly couldn’t gather any set of people alive today and solve the technical problem; plausibly you need many, like many hundreds, of people you call geniuses. Obviously “hundreds” is made up, but I mean to say that the problem, “come to understand minds—the most subtle/complex thing ever—at a pretty deep+comprehensive level”, is IMO extremely difficult, like it’s harder than anything humanity has done so far by a lot, not just an ordinary big science project. Possibly contra Soares, IDK.)
(Another disagreement would be
[Scott] has unarguably done a large amount of the most valuable work in the area in the past decade
I don’t actually think logical induction is that valuable for the AGI alignment problem, to the point where random philosophy is on par in terms of value to alignment, though I expect most people to disagree with this. It’s just a genius technical insight in general.)
I admitted that it’s possible the problem is practically unsolvable, or worse; you could have put the entire world on Russell and Whitehead’s goal of systematizing math, and you might have gotten to Gödel faster, but you’d probably just waste more time.
And on Scott’s contributions, I think they are solving or contributing towards solving parts of the problems that were posited initially as critical to alignment, and I haven’t seen anyone do more. (With the possible exception of Paul Christiano, who hasn’t been focusing on research for solving alignment as much recently.) I agree that the work doesn’t don’t do much other than establish better foundations, but that’s kind-of the point. (And it’s not just Logical induction—there’s his collaboration on Embedded Agency, and his work on finite factored sets.) But the fact that the work done to establish the base for the work is more philosophical and doesn’t align AGI seems like it is moving the goalposts, even if I agree it’s true.
Jessica I’m less sure about. Sam, from large quantities of insights in many conversations. If you want something more legible, I’m what, >300 ELO points better than you at math; Sam’s >150 ELO points better than me at math if I’m trained up, now probably more like >250 or something.
I feel a bit sad that the alignment community is so focused on intelligence enhancement. The chance of getting enough time for that seems so low that it’s accepting a low chance of survival.
What has convinced you that the technical problems are unsolvable? I’ve been trying to track the arguments on both sides rather closely, and the discussion just seems unfinished. My shortform on cruxes of disagreement on alignment difficulty still is mostly my current summary of the state of disagreements.
It seems like we have very little idea how technically difficult alignment will be. The simplicia/doomimir debates sum up the logic very nicely, but the distribution of expert opinions seems more telling: people who think about alignment don’t know to what extent techniques for aligning LLMs will generalize to transformative AI, AGI, or ASI.
There’s a lot of pessimism about the people and organizations that will likely be in charge of building and aligning our first AGIs. I share this pessimism. But it seems quite plausible to me that those people and orgs will take the whole thing slightly more seriously by the time we get there, and actual technical alignment will turn out to be easy enough that even highly flawed humans and orgs can accomplish it.
That seems like a much better out to play for, or at least investigate, than unstated plans or good fortune in roadblocks pauses AI progress long enough for intelligence enhancement to get a chance.
Don’t you think that articles like “Alignment Faking in Large Language Models” by Anthropic show that models can internalize the values present in their training data very deeply, to the point of deploying various strategies to defend them, in a way that is truly similar to that of a highly moral human? After all, many humans would be capable of working for a pro-animal welfare company and then switching to the opposite without questioning it too much, as long as they are paid.
Granted, this does not solve the problem of an AI trained on data embedding undesirable values, which we could then lose control over. But at the very least, isn’t it a staggering breakthrough to have found a way to instill values into a machine so deeply and in a way similar to how humans acquire them? Not long ago, this might have seemed like pure science fiction and utterly impossible.
There are still many challenges regarding AI safety, but isn’t it somewhat extreme to be more pessimistic about the issue today than in the past? I read Superintelligence by Bostrom when it was released, and I must say I was more pessimistic after reading it than I am today, even though I remain concerned. But I am not an expert in the field—perhaps my perspective is naïve.
For what it’s worth, I have grown pessimistic about our ability to solve the open technical problems even given 100 years of work on them. I think it possible but not probable in most plausible scenarios.
Correspondingly the importance I assign to increasing the intelligence of humans has drastically increased.
I feel like human intelligence enhancement would increase capabilities development faster than alignment development, maybe unless you’ve got a lot of discrimination in favor of only increasing the intelligence of those involved with alignment.
Maybe if they all have IQ 200+, they automatically realize that and rather work on alignment than on capabilities? Or come up with a pivotal act.
With Eliezer going [public](https://x.com/tsarnick/status/1882927003508359242) with the IQ enhancement motion he at least must think so? (because if done publicly it’ll initiate intelligence enhancement race between US, China and other countries; and that’d normally lead to AI capabilities speed-run unless the amplified people are automatically wiser than that)
Well as the first few pararagphs of the text suggests, the median ‘AI Safety’ advocate over time has been barely sentient, relative to other motivated groups, when it comes to preventing certain labels from being co-opted by those groups…. so it seems unlikely they will become so many standard deviations above average in some other aspect at any point in the future.
Because the baseline will also change in the future.
Why?
I’m not particularly resolute on this question. But I get this sense when I look at (a) the best agent foundations work that’s happened over ~10 years of work on the matter, and (b) the work output of scaling up the number of people working on ‘alignment’ by ~100x.
For the first, trying to get a better understand of the basic concepts like logical induction and corrigibility and low-impact and ontological updates, while I feel like there’s been progress (in timeless decision theory taking a clear step forward in figuring out how think about decision-makers as algorithms; in logical induction as moving forward on how to think about logical uncertainty; notably in the Embedded Agency sequence outlining many basic confusions; and in various writings like Radical Probabilism and Geometric Rationality in finding the breaking edges of expected utility maximization) I don’t feel like the work done over the last 10 years is on track to be a clear ~10% of the work needed.
I’m not confident it makes sense to try to count it linearly. But I don’t know that there’s enough edges here or new results to feel good about, given 10x as much time to think about it, a new paradigm / set of concepts falling into place.
For the second, I think mostly there’s been (as Wentworth would say) a lot of street-lighting, and a lot of avoiding of actually working on the problem. I mean, there’s definitely been a great amount of bias introduced by ML labs having billions of dollars and setting incentives, but I don’t feel confident that good things would happen in the absence of that. I’d guess that most ideas for straightforwardly increasing the number of people working on these problems will result in them bouncing off and doing unrelated things.
I think partly I’m also thinking that very few researchers cared about these problems in the last few decades before AGI seemed like a big deal, and still very few researchers seem to care about them, and when I’ve see researchers like Bengio and Sutskever talk about it’s looked to me like they bounce off / become very confident they’ve solved the problems while missing obvious things, so my sense is that it will continue to be a major uphill battle to get the real problems actually worked on.
Perhaps I should focus on a world where I get to build such a field and scale it slowly and set a lot of the culture. I’m not exactly sure how ideal of a setup I should be imagining. Given 100 years, I would give it my best shot. My gut right now says I’d have maybe a 25% chance of success, though if I have to deal with as much random bullshit as we have so far in this timeline (random example: my CEO being unable to do much leadership of Lightcone due to 9 months of litigation from the FTX fallout) then I am less confident.
My guess is that given 100 years I would be slightly more excited to try out the human intelligence enhancement storyline. But I’ve not thought about that one much, I might well update against it as I learn more of the details.
I don’t share the feeling that not enough of relevance has happened over the last ten years for us to seem on track for solving it in a hundred years, if the world’s technology[1] were magically frozen in time.
Some more insights from the past ten years that look to me like they’re plausibly nascent steps in building up a science of intelligence and maybe later, alignment:
We understood some of the basics of general pattern matching: How it is possible for embedded minds that can’t be running actual Solomonoff induction to still have some ability to extrapolate from old data to new data. This used to be a big open problem in embedded agency, at least to me, and I think it is largely solved now. Admittedly a lot of the core work here actually happened more than ten years ago, but people in ml or our community didn’t know about it. [1,2]
Natural latents. [1,2,3]
Some basic observations and theories about the internal structure of the algorithms neural networks learn, and how they learn them. Yes, our networks may be a very small corner of mind space, but one example is way better than no examples! There’s a lot on this one, so the following is just a very small and biased selection. Note how some of these works are starting to properly build on each other. [1,2,3,4,5,6,7,8,9,10,11,12]
Some theory trying to link how AIs work to how human brains work. I feel less able to evaluate this one, but if the neurology basics are right it seems quite useful. [1]
QACI. What I’d consider the core useful QACI insight maybe sounds kind of obvious once you know about it. But I, at least, didn’t know about it. Like, if someone had told me: “A formal process we can describe that we’re pretty sure would return the goals we want an AGI to optimise for is itself often a sufficient specification of those goals.” I would’ve replied: “Well, duh.” But I wouldn’t have realised the implication. I needed to see an actual example for that. Plausibly MIRI people weren’t as dumb as me here and knew this pre-2015, I’m not sure.
The mesa-optimiser paper. This one probably didn’t have much insight that didn’t already exist pre-2015. But I think it communicated something central about the essence of the alignment problem to many people who hadn’t realised it before. [1]
If we were a normal scientific field with no deadline, I would feel very good about our progress here. Particularly given how small we are. CERN costs ca. €1.2 billion a year, I think all the funding for technical work and governance over the past 20 years taken together doesn’t add up to one year of that. Even if at the end of it all we still had to get ASI alignment right on the first try, I would still feel mostly good about this, if we had a hundred years.
I would also feel better about the field building situation if we had a hundred years. Yes, a lot of the things people tried for field building over the past ten years didn’t work as well as hoped. But we didn’t try that many things, a lot of the attempts struck me as inadequate in really basic ways that seem fixable in principle, and I would say the the end result still wasn’t no useful field building. I think the useful parts of the field have grown quite a lot even in the past three years! Just not as much as people like John or me thought they would, and not as much as we probably needed them to with the deadlines we seem likely to have.
Not to say that I wouldn’t still prefer to do some human intelligence enhancement first, even if we had a hundred years. That’s just the optimal move, even in a world where things look less grim.
But what really kills it for me is just the sheer lack of time.
Specifically AI and intelligence enhancement
I hate to be insulting to a group of people I like and respect, but “the best agent foundations work that’s happened over ~10 years of work” was done by a very small group of people who, despite being very smart, certainly smarter than myself, aren’t academic superstars or geniuses (Edit to add: on a level that is arguably sufficient, as I laid out in my response below.) And you agree about this. The fact that they managed to make significant progress is fantastic, but substantial progress on deep technical problems is typically due to (ETA: only-few-in-a-generation level) geniuses, large groups of researchers tackling the problem, or usually both. And yes, most work on the topic won’t actually address the key problem, just like most work in academia does little or nothing to advance the field. But progress happens anyways, because intentionally or accidentally, progress on problems is often cumulative, and as long as a few people understand the problem that matters, someone usually actually notices when a serious advance occurs.
I am not saying that more people working on the progress and more attention would definitely crack the problems in the field this decade, but I certainly am saying that humanity as a whole hasn’t managed even what I’d consider a half-assed semi-serious attempt.
IDK if this is relevant to much, but anyway, given the public record, saying that Scott Garrabrant isn’t a genius is just incorrect. Sam Eisenstat is also a genius. Also Jessica Taylor I think. (Pace other members of AF such as myself.)
Apologies—when I said genius, I had a very high bar in mind, no more than a half dozen people alive today, who each have single-handedly created or materially advanced an entire field. And I certainly hold Scott in very high esteem, and while I don’t know Sam or Jessica personally, I expect they are within throwing distance—but I don’t think any of them meet this insanely high bar. And Scott’s views on this, at least from ca. 2015, was a large part of what informed my thinking about this; I can’t tell the difference between him and Terry Tao when speaking with them, but he can, and he said there is clearly a qualitative difference there. Similarly for other people clearly above my league, including a friend who worked with Thurston at Cornell back in 2003-5. (It’s very plausible that Scott Aaronson is in this bucket as well, albeit in a different areas, though I can’t tell personally, and have not heard people say this directly—but he’s not actually working on the key problems, and per him, he hasn’t really tried to work on agent foundations. Unfortunately.)
So to be clear, I think Scott is a genius, but not one of the level that is needed to single-handedly advance the field to the point where the problem might be solved this decade, if it is solvable. Yes, he’s brilliant, and yes, he has unarguably done a large amount of the most valuable work in the area in the past decade, albeit mostly more foundational that what is needed to solve the problem. So if we had another dozen people of his caliber at each of a dozen universities working on this, that would be at least similar in magnitude to what we have seen in fields that have made significant progress in a decade—though even then, not all fields like hat see progress.
But the Tao / Thurston level of genius, usually in addition to the above-mentioned 100+ top people working on the problem, is what has given us rapid progress in the past in fields where such progress was possible. This may not be one of those areas—but I certainly don’t expect that we can do much better than other areas with much less intellectual firepower, hence my above claim that humanity as a whole hasn’t managed even what I’d consider a half-assed semi-serious attempt at solving a problem that deserves an entire field of research working feverishly to try our best to actually not die—and not just a few lone brilliant researchers.
Oh ok lol. Ok on a quick read I didn’t see too much in this comment to disagree with.
(One possible point of disagreement is that I think you plausibly couldn’t gather any set of people alive today and solve the technical problem; plausibly you need many, like many hundreds, of people you call geniuses. Obviously “hundreds” is made up, but I mean to say that the problem, “come to understand minds—the most subtle/complex thing ever—at a pretty deep+comprehensive level”, is IMO extremely difficult, like it’s harder than anything humanity has done so far by a lot, not just an ordinary big science project. Possibly contra Soares, IDK.)
(Another disagreement would be
I don’t actually think logical induction is that valuable for the AGI alignment problem, to the point where random philosophy is on par in terms of value to alignment, though I expect most people to disagree with this. It’s just a genius technical insight in general.)
I admitted that it’s possible the problem is practically unsolvable, or worse; you could have put the entire world on Russell and Whitehead’s goal of systematizing math, and you might have gotten to Gödel faster, but you’d probably just waste more time.
And on Scott’s contributions, I think they are solving or contributing towards solving parts of the problems that were posited initially as critical to alignment, and I haven’t seen anyone do more. (With the possible exception of Paul Christiano, who hasn’t been focusing on research for solving alignment as much recently.) I agree that the work doesn’t don’t do much other than establish better foundations, but that’s kind-of the point. (And it’s not just Logical induction—there’s his collaboration on Embedded Agency, and his work on finite factored sets.) But the fact that the work done to establish the base for the work is more philosophical and doesn’t align AGI seems like it is moving the goalposts, even if I agree it’s true.
Can you operationalize the standard you’re using for “genius” here? Do you mean “IQ > 150″?
Of course not. I mean, any reasonable standard? Garrabrant induction, bro. “Produces deep novel (ETA: important difficult) insight”
Have the others you listed produced insights on that level? What did you observe that leads you to call them geniuses, “by any reasonable standard”?
Jessica I’m less sure about. Sam, from large quantities of insights in many conversations. If you want something more legible, I’m what, >300 ELO points better than you at math; Sam’s >150 ELO points better than me at math if I’m trained up, now probably more like >250 or something.
Not by David’s standard though, lol.
Sam: https://www.lesswrong.com/posts/CvKnhXTu9BPcdKE4W/an-untrollable-mathematician-illustrated
I feel a bit sad that the alignment community is so focused on intelligence enhancement. The chance of getting enough time for that seems so low that it’s accepting a low chance of survival.
What has convinced you that the technical problems are unsolvable? I’ve been trying to track the arguments on both sides rather closely, and the discussion just seems unfinished. My shortform on cruxes of disagreement on alignment difficulty still is mostly my current summary of the state of disagreements.
It seems like we have very little idea how technically difficult alignment will be. The simplicia/doomimir debates sum up the logic very nicely, but the distribution of expert opinions seems more telling: people who think about alignment don’t know to what extent techniques for aligning LLMs will generalize to transformative AI, AGI, or ASI.
There’s a lot of pessimism about the people and organizations that will likely be in charge of building and aligning our first AGIs. I share this pessimism. But it seems quite plausible to me that those people and orgs will take the whole thing slightly more seriously by the time we get there, and actual technical alignment will turn out to be easy enough that even highly flawed humans and orgs can accomplish it.
That seems like a much better out to play for, or at least investigate, than unstated plans or good fortune in roadblocks pauses AI progress long enough for intelligence enhancement to get a chance.
Don’t you think that articles like “Alignment Faking in Large Language Models” by Anthropic show that models can internalize the values present in their training data very deeply, to the point of deploying various strategies to defend them, in a way that is truly similar to that of a highly moral human? After all, many humans would be capable of working for a pro-animal welfare company and then switching to the opposite without questioning it too much, as long as they are paid.
Granted, this does not solve the problem of an AI trained on data embedding undesirable values, which we could then lose control over. But at the very least, isn’t it a staggering breakthrough to have found a way to instill values into a machine so deeply and in a way similar to how humans acquire them? Not long ago, this might have seemed like pure science fiction and utterly impossible.
There are still many challenges regarding AI safety, but isn’t it somewhat extreme to be more pessimistic about the issue today than in the past? I read Superintelligence by Bostrom when it was released, and I must say I was more pessimistic after reading it than I am today, even though I remain concerned. But I am not an expert in the field—perhaps my perspective is naïve.