What would you consider as possible angles of attack to the problem? A few points to address that come to mind:
feasibility of AGI itself. Honestly may be the hardest thing to pinpoint;
feasibility of AGI performing self-improvement. This might be more interesting but only focusing on a specific paradigm. I think there might be a decent case for suggesting the LLM paradigm, even in an agentic loop and equipped with tools, eventually stagnates and never goes anywhere in terms of true creativity. But that’s just telling capability researchers to invent something else;
ability to escape. Some kind of analysis of e.g. how much agency can an AI exert on the physical world, what’s the fastest path to having a beachhead (such as, what would be the costs of it having a robot built for itself, even assuming it was super smart and designed better robotics than ours? Would it be realistic for it to go entirely unnoticed?)
more general game theory/modelling about instrumental convergence and power seeking being optimal. I can think of experiments with game worlds and AIs of some sort set into it to try and achieve competing goals, or even some kind of purely mathematical model. This seems pretty trivial though, and I’m not sure people wouldn’t just say “ok but it applies to your toy model, not the real world”
more general information theoretical arguments about the limits of intelligence, or lack thereof, in controlling a chaotic system? What are the Lyapunov exponents of real world systems? Seems like this would affect the diminishing returns of intelligence in controlling the world.
With what little I know now I think 2 would be most clear to people. However I appreciate that that might contribute to capabilities, so maybe exfohazard.
4 is definitely interesting, and I think there are actually a few significant papers about instrumental convergence. More of those would be good, but I don’t think that gets to the heart of the matter w.r.t a simple model to aid communication.
5. I would love some more information theory stuff, drilling into how much information is communicated to eg. a model relative to how much is contained in the world. This could at the very least put some bounds on orthogonality (if ‘alignment’ is seen in terms of ‘preserving information’). I feel like this could be a productive avenue, but personally worry its above my pay grade (I did an MSc in Experimental Physics but its getting rustier by the day).
Now I think about it, maybe 1 and 3 would also contribute to a ‘package’ if this was seen as a nothing but an attempt at didactics. But maybe including every step of the way complicates things too much, ideally there would be a core idea that could get most of the message across on its own. I think Orthogonality does this for a lot of people in LW, and maybe just a straightforward explainer of that with some information-theory sugar would be enough.
I was thinking more that the question here was also about more rigorous and less qualitative papers supporting the thesis, than just explanations for laypeople. One of the most common arguments against AI safety is that it’s unscientific because it doesn’t have rigorous theoretical support. I’m not super satisfied with that criticism (I feel like the general outlines are clear enough, and I don’t think you can really make up some quantitative framework to predict, e.g., which fraction of goals in the total possible goal-space benefit from power-seeking and self-preservation, so in the end you still have to go with the qualitative argument and your feel for how much does it apply to reality), but I think if it has to be allayed, it should be by something that targets specific links in the causal chain of Doom. Important side bonus, formalizing and investigating these problems might actually reveal interesting potential alignment ideas.
I’ll have to read those papers you linked, but to me in general it feels like perhaps the topic more amenable to this sort of treatment is indeed Instrumental Convergence. The Orthogonality Thesis feels to me more of a philosophical statement, and indeed we’ve had someone arguing for moral realism here just days ago. I don’t think you can really prove it or not from where we are. But I think if you phrased it as “being smart does not make you automatically good” you’d find that most people agree with you—especially people of the persuasion that right now regards AI safety and TESCREAL people as they dubbed us with most suspicion. Orthogonality is essentially moral relativism!
Now if we’re talking about a more outreach-oriented discussion, then I think all concepts can be explained pretty clearly. I’d also recommend using analogies to e.g. invasive species in new habitats, or the evils of colonialism, to stress why and how it’s both dangerous and unethical to unleash things that are more capable than us and are driven by too simple and greedy a goal on the world; insist on the fact that what makes us special is the richness and complexity of our values, and that our highest values are the ones that most prevent us from simply going on a power seeing rampage. That makes the notion of the first AGI being dangerous pretty clear: if you focus only on making them smart but you slack off on making them good, the latter part will be pretty rudimentary, and so you’re creating something that is like a colony of intelligent bacteria.
What would you consider as possible angles of attack to the problem? A few points to address that come to mind:
feasibility of AGI itself. Honestly may be the hardest thing to pinpoint;
feasibility of AGI performing self-improvement. This might be more interesting but only focusing on a specific paradigm. I think there might be a decent case for suggesting the LLM paradigm, even in an agentic loop and equipped with tools, eventually stagnates and never goes anywhere in terms of true creativity. But that’s just telling capability researchers to invent something else;
ability to escape. Some kind of analysis of e.g. how much agency can an AI exert on the physical world, what’s the fastest path to having a beachhead (such as, what would be the costs of it having a robot built for itself, even assuming it was super smart and designed better robotics than ours? Would it be realistic for it to go entirely unnoticed?)
more general game theory/modelling about instrumental convergence and power seeking being optimal. I can think of experiments with game worlds and AIs of some sort set into it to try and achieve competing goals, or even some kind of purely mathematical model. This seems pretty trivial though, and I’m not sure people wouldn’t just say “ok but it applies to your toy model, not the real world”
more general information theoretical arguments about the limits of intelligence, or lack thereof, in controlling a chaotic system? What are the Lyapunov exponents of real world systems? Seems like this would affect the diminishing returns of intelligence in controlling the world.
With what little I know now I think 2 would be most clear to people. However I appreciate that that might contribute to capabilities, so maybe exfohazard.
4 is definitely interesting, and I think there are actually a few significant papers about instrumental convergence. More of those would be good, but I don’t think that gets to the heart of the matter w.r.t a simple model to aid communication.
5. I would love some more information theory stuff, drilling into how much information is communicated to eg. a model relative to how much is contained in the world. This could at the very least put some bounds on orthogonality (if ‘alignment’ is seen in terms of ‘preserving information’). I feel like this could be a productive avenue, but personally worry its above my pay grade (I did an MSc in Experimental Physics but its getting rustier by the day).
Now I think about it, maybe 1 and 3 would also contribute to a ‘package’ if this was seen as a nothing but an attempt at didactics. But maybe including every step of the way complicates things too much, ideally there would be a core idea that could get most of the message across on its own. I think Orthogonality does this for a lot of people in LW, and maybe just a straightforward explainer of that with some information-theory sugar would be enough.
I was thinking more that the question here was also about more rigorous and less qualitative papers supporting the thesis, than just explanations for laypeople. One of the most common arguments against AI safety is that it’s unscientific because it doesn’t have rigorous theoretical support. I’m not super satisfied with that criticism (I feel like the general outlines are clear enough, and I don’t think you can really make up some quantitative framework to predict, e.g., which fraction of goals in the total possible goal-space benefit from power-seeking and self-preservation, so in the end you still have to go with the qualitative argument and your feel for how much does it apply to reality), but I think if it has to be allayed, it should be by something that targets specific links in the causal chain of Doom. Important side bonus, formalizing and investigating these problems might actually reveal interesting potential alignment ideas.
I’ll have to read those papers you linked, but to me in general it feels like perhaps the topic more amenable to this sort of treatment is indeed Instrumental Convergence. The Orthogonality Thesis feels to me more of a philosophical statement, and indeed we’ve had someone arguing for moral realism here just days ago. I don’t think you can really prove it or not from where we are. But I think if you phrased it as “being smart does not make you automatically good” you’d find that most people agree with you—especially people of the persuasion that right now regards AI safety and TESCREAL people as they dubbed us with most suspicion. Orthogonality is essentially moral relativism!
Now if we’re talking about a more outreach-oriented discussion, then I think all concepts can be explained pretty clearly. I’d also recommend using analogies to e.g. invasive species in new habitats, or the evils of colonialism, to stress why and how it’s both dangerous and unethical to unleash things that are more capable than us and are driven by too simple and greedy a goal on the world; insist on the fact that what makes us special is the richness and complexity of our values, and that our highest values are the ones that most prevent us from simply going on a power seeing rampage. That makes the notion of the first AGI being dangerous pretty clear: if you focus only on making them smart but you slack off on making them good, the latter part will be pretty rudimentary, and so you’re creating something that is like a colony of intelligent bacteria.