AGI Safety from first principles doesn’t meet all of Tyler’s requirements and desiderata, but I think it’s a good introduction for a technical but skeptical / uninitiated audience.
… in a leading science journal, refereed of course.
Being published and upvoted / endorsed by peers on the Alignment Forum should arguably count as this. Highly-upvoted content on the Alignment Forum is often higher in quality than papers published in even the best traditional academic journals.
I think the question of where and how to conduct science and review is a separate question from the question of why to care about AI risk though. I am happy to read content published in any venue, though I might be hesitant to spend much time diving into any particular piece, unless it has been endorsed by someone I trust on Twitter, LW, AF, etc. Publication in an academic journal is just a different kind of endorsement, one that, in my experience, is a weaker and less reliable indicator of quality than the others.
Thats a good paper, but I think it exemplifies the problem outlined by Cowen—it mostly contains references to Bostrom and Yudkowsky, it doesn’t really touch on more technical stuff (Yampolskiy, Schmidhuber) which exists, which makes me think that it isn’t a very thorough review of the field. It seems like more of the same. Maybe the Hubinger paper referenced therein is on the right track?
The question of where to do science is relevent but not important—Cowen even mentions that ‘if it doesn’t get published, just post it online’—he is not against reading forums.
It really looks like there could be enough stuff out there to make a model. Which makes me think the scepticism is even more justified! Because if it looks like a duck and talks like a duck but doesn’t float like a duck, maybe its a lump of stone?
What would you consider as possible angles of attack to the problem? A few points to address that come to mind:
feasibility of AGI itself. Honestly may be the hardest thing to pinpoint;
feasibility of AGI performing self-improvement. This might be more interesting but only focusing on a specific paradigm. I think there might be a decent case for suggesting the LLM paradigm, even in an agentic loop and equipped with tools, eventually stagnates and never goes anywhere in terms of true creativity. But that’s just telling capability researchers to invent something else;
ability to escape. Some kind of analysis of e.g. how much agency can an AI exert on the physical world, what’s the fastest path to having a beachhead (such as, what would be the costs of it having a robot built for itself, even assuming it was super smart and designed better robotics than ours? Would it be realistic for it to go entirely unnoticed?)
more general game theory/modelling about instrumental convergence and power seeking being optimal. I can think of experiments with game worlds and AIs of some sort set into it to try and achieve competing goals, or even some kind of purely mathematical model. This seems pretty trivial though, and I’m not sure people wouldn’t just say “ok but it applies to your toy model, not the real world”
more general information theoretical arguments about the limits of intelligence, or lack thereof, in controlling a chaotic system? What are the Lyapunov exponents of real world systems? Seems like this would affect the diminishing returns of intelligence in controlling the world.
With what little I know now I think 2 would be most clear to people. However I appreciate that that might contribute to capabilities, so maybe exfohazard.
4 is definitely interesting, and I think there are actually a few significant papers about instrumental convergence. More of those would be good, but I don’t think that gets to the heart of the matter w.r.t a simple model to aid communication.
5. I would love some more information theory stuff, drilling into how much information is communicated to eg. a model relative to how much is contained in the world. This could at the very least put some bounds on orthogonality (if ‘alignment’ is seen in terms of ‘preserving information’). I feel like this could be a productive avenue, but personally worry its above my pay grade (I did an MSc in Experimental Physics but its getting rustier by the day).
Now I think about it, maybe 1 and 3 would also contribute to a ‘package’ if this was seen as a nothing but an attempt at didactics. But maybe including every step of the way complicates things too much, ideally there would be a core idea that could get most of the message across on its own. I think Orthogonality does this for a lot of people in LW, and maybe just a straightforward explainer of that with some information-theory sugar would be enough.
I was thinking more that the question here was also about more rigorous and less qualitative papers supporting the thesis, than just explanations for laypeople. One of the most common arguments against AI safety is that it’s unscientific because it doesn’t have rigorous theoretical support. I’m not super satisfied with that criticism (I feel like the general outlines are clear enough, and I don’t think you can really make up some quantitative framework to predict, e.g., which fraction of goals in the total possible goal-space benefit from power-seeking and self-preservation, so in the end you still have to go with the qualitative argument and your feel for how much does it apply to reality), but I think if it has to be allayed, it should be by something that targets specific links in the causal chain of Doom. Important side bonus, formalizing and investigating these problems might actually reveal interesting potential alignment ideas.
I’ll have to read those papers you linked, but to me in general it feels like perhaps the topic more amenable to this sort of treatment is indeed Instrumental Convergence. The Orthogonality Thesis feels to me more of a philosophical statement, and indeed we’ve had someone arguing for moral realism here just days ago. I don’t think you can really prove it or not from where we are. But I think if you phrased it as “being smart does not make you automatically good” you’d find that most people agree with you—especially people of the persuasion that right now regards AI safety and TESCREAL people as they dubbed us with most suspicion. Orthogonality is essentially moral relativism!
Now if we’re talking about a more outreach-oriented discussion, then I think all concepts can be explained pretty clearly. I’d also recommend using analogies to e.g. invasive species in new habitats, or the evils of colonialism, to stress why and how it’s both dangerous and unethical to unleash things that are more capable than us and are driven by too simple and greedy a goal on the world; insist on the fact that what makes us special is the richness and complexity of our values, and that our highest values are the ones that most prevent us from simply going on a power seeing rampage. That makes the notion of the first AGI being dangerous pretty clear: if you focus only on making them smart but you slack off on making them good, the latter part will be pretty rudimentary, and so you’re creating something that is like a colony of intelligent bacteria.
AGI Safety from first principles doesn’t meet all of Tyler’s requirements and desiderata, but I think it’s a good introduction for a technical but skeptical / uninitiated audience.
Being published and upvoted / endorsed by peers on the Alignment Forum should arguably count as this. Highly-upvoted content on the Alignment Forum is often higher in quality than papers published in even the best traditional academic journals.
I think the question of where and how to conduct science and review is a separate question from the question of why to care about AI risk though. I am happy to read content published in any venue, though I might be hesitant to spend much time diving into any particular piece, unless it has been endorsed by someone I trust on Twitter, LW, AF, etc. Publication in an academic journal is just a different kind of endorsement, one that, in my experience, is a weaker and less reliable indicator of quality than the others.
Thats a good paper, but I think it exemplifies the problem outlined by Cowen—it mostly contains references to Bostrom and Yudkowsky, it doesn’t really touch on more technical stuff (Yampolskiy, Schmidhuber) which exists, which makes me think that it isn’t a very thorough review of the field. It seems like more of the same. Maybe the Hubinger paper referenced therein is on the right track?
The question of where to do science is relevent but not important—Cowen even mentions that ‘if it doesn’t get published, just post it online’—he is not against reading forums.
It really looks like there could be enough stuff out there to make a model. Which makes me think the scepticism is even more justified! Because if it looks like a duck and talks like a duck but doesn’t float like a duck, maybe its a lump of stone?
What would you consider as possible angles of attack to the problem? A few points to address that come to mind:
feasibility of AGI itself. Honestly may be the hardest thing to pinpoint;
feasibility of AGI performing self-improvement. This might be more interesting but only focusing on a specific paradigm. I think there might be a decent case for suggesting the LLM paradigm, even in an agentic loop and equipped with tools, eventually stagnates and never goes anywhere in terms of true creativity. But that’s just telling capability researchers to invent something else;
ability to escape. Some kind of analysis of e.g. how much agency can an AI exert on the physical world, what’s the fastest path to having a beachhead (such as, what would be the costs of it having a robot built for itself, even assuming it was super smart and designed better robotics than ours? Would it be realistic for it to go entirely unnoticed?)
more general game theory/modelling about instrumental convergence and power seeking being optimal. I can think of experiments with game worlds and AIs of some sort set into it to try and achieve competing goals, or even some kind of purely mathematical model. This seems pretty trivial though, and I’m not sure people wouldn’t just say “ok but it applies to your toy model, not the real world”
more general information theoretical arguments about the limits of intelligence, or lack thereof, in controlling a chaotic system? What are the Lyapunov exponents of real world systems? Seems like this would affect the diminishing returns of intelligence in controlling the world.
With what little I know now I think 2 would be most clear to people. However I appreciate that that might contribute to capabilities, so maybe exfohazard.
4 is definitely interesting, and I think there are actually a few significant papers about instrumental convergence. More of those would be good, but I don’t think that gets to the heart of the matter w.r.t a simple model to aid communication.
5. I would love some more information theory stuff, drilling into how much information is communicated to eg. a model relative to how much is contained in the world. This could at the very least put some bounds on orthogonality (if ‘alignment’ is seen in terms of ‘preserving information’). I feel like this could be a productive avenue, but personally worry its above my pay grade (I did an MSc in Experimental Physics but its getting rustier by the day).
Now I think about it, maybe 1 and 3 would also contribute to a ‘package’ if this was seen as a nothing but an attempt at didactics. But maybe including every step of the way complicates things too much, ideally there would be a core idea that could get most of the message across on its own. I think Orthogonality does this for a lot of people in LW, and maybe just a straightforward explainer of that with some information-theory sugar would be enough.
I was thinking more that the question here was also about more rigorous and less qualitative papers supporting the thesis, than just explanations for laypeople. One of the most common arguments against AI safety is that it’s unscientific because it doesn’t have rigorous theoretical support. I’m not super satisfied with that criticism (I feel like the general outlines are clear enough, and I don’t think you can really make up some quantitative framework to predict, e.g., which fraction of goals in the total possible goal-space benefit from power-seeking and self-preservation, so in the end you still have to go with the qualitative argument and your feel for how much does it apply to reality), but I think if it has to be allayed, it should be by something that targets specific links in the causal chain of Doom. Important side bonus, formalizing and investigating these problems might actually reveal interesting potential alignment ideas.
I’ll have to read those papers you linked, but to me in general it feels like perhaps the topic more amenable to this sort of treatment is indeed Instrumental Convergence. The Orthogonality Thesis feels to me more of a philosophical statement, and indeed we’ve had someone arguing for moral realism here just days ago. I don’t think you can really prove it or not from where we are. But I think if you phrased it as “being smart does not make you automatically good” you’d find that most people agree with you—especially people of the persuasion that right now regards AI safety and TESCREAL people as they dubbed us with most suspicion. Orthogonality is essentially moral relativism!
Now if we’re talking about a more outreach-oriented discussion, then I think all concepts can be explained pretty clearly. I’d also recommend using analogies to e.g. invasive species in new habitats, or the evils of colonialism, to stress why and how it’s both dangerous and unethical to unleash things that are more capable than us and are driven by too simple and greedy a goal on the world; insist on the fact that what makes us special is the richness and complexity of our values, and that our highest values are the ones that most prevent us from simply going on a power seeing rampage. That makes the notion of the first AGI being dangerous pretty clear: if you focus only on making them smart but you slack off on making them good, the latter part will be pretty rudimentary, and so you’re creating something that is like a colony of intelligent bacteria.