Here is my honest reaction as another data point. (Well done by the parent for taking the initiative!)
Context: Got introduced to this field around a year ago. Not an expert.
My honest reaction is rather worried as well (to put it mildly).
1. I agree with this. My impression is that in many tasks we currently require a lot more data than humans, but I do not see any reason to expect that it will always be so.
2. I broadly agree with this. I am sympathetic to people who would like to see more of concrete stories about how exactly an AGI would take over the world (while there are some already, more wouldn’t hurt). Meanwhile,
- I believe that if effort is put into inventing such takeover scenarios, then one expects to come up with quite many of them. Hence, update already.
- I haven’t looked into nanobots myself, so no inside view there, but my prior is definitely on “there are lots of (causally) powerful technologies we haven’t invented yet”.
- The AI box experiment really feels like strong empirical evidence for the bootstrapping argument
3. I agree with this as stated. I do wonder, though, whether we will get any warning shots, where we operate at a semi-dangerous level and fail. This seems to reduce to slow vs. fast takeoff. (I don’t have a consistent opinion on that.)
4. Agree that there is a time limit. And indeed, recognition of the issue and cooperation from the relevant actors seems non-ideal.
5. Agree.
6. I’m not sure here—I agree that we should avoid the situation where we have multiple AGIs. If “pivotal act” is defined as an act which results in this outcome, then there is agreement, but as someone pointed out, it might be that the pivotal act is something which doesn’t fit the mental picture one associates with the words “pivotal act”.
7. I notice I am confused here: I’m not sure what “pivotal weak act” means, or what “something weak enough with an AGI to be *passively safe*” means. I agree with “no one knows of any pivotal act you could do with just current SOTA AI”. I don’t have good intuitions about the space of pivotal actions—I haven’t thought about it.
8. I interpret “problems we want an AI to solve” means problems relevant for pivotal acts. In this case, see above—I don’t have intuitions about pivotal acts.
9. See above.
10. Broadly agree.
11. Again, don’t know much about pivotal acts. (It is mentioned that “Pivotal weak acts like this aren’t known, and not for want of people looking for them.”—have I missed some big projects on pivotal acts.)
12. Agree.
13. Agree.
14. Agree. The discontinuity / “treacherous turn” seems obvious to me when thought about from first principles. The skeptic voice in my head says that nothing like that has happened in practice (to my knowledge), but that really does not assure me.
15. Broadly agree, though I lack good examples for the concept “alignment-required invariants”. My best guess: there is interpretability research on neural networks, and we have some non-trivial understanding there. That might turn out to be not relevant in case of a great new idea for capabilities.
16. I agree that the concept of inner alignment is important. There is an empirical verification for it. I am unsure about how big of a problem this will be in practice. I do appreciate the point about evolution.
17. I like this formulation (quite crisp), I don’t think I’ve seen it anywhere before. To me, it seems like an interesting idea to try to come up with ways for getting inner properties to systems.
18. Agree.
19. Agree.
20. Agree, except I don’t understand what “If you perfectly learn and perfectly maximize the referent of rewards assigned by human operators, that kills them.” means.
21. Not sure I get the central result, but I get the idea that in capabilities you have feedback loops in a different way from utility functions.
22. Agree.
23. A good, crisp formulation. Agree.
24. A good distinction, sure. In other words “let the AGI optimize something, no strings attached (and choose that “something” very carefully)” vs. “try to control/restrict the AGI”. I’m wondering whether there are any alternatives.
25. “We’ve got no idea” seems to me like a bit of an exaggeration, but I agree with the latter sentence.
26. Yep.
27. Yep, an instance of Goodhart’s law.
28. Yep.
29. I agree that this is the generic case—if you take a complex action sequence of AGI by random, it is almost surely uninterpretable by humans. Not sure what would happen if you optimized for plans in which humans are confident they understand the consequences. Sure, we have to fight against Goodhart’s law, and I do think that against sufficiently powerful cognitive systems our chances would be slim, but I’m not sure that one couldn’t extract enough information to perform a pivotal act. Failure at AI boxing does seem like a major bottleneck, though.
30. I agree up to “it knows … that some action sequence results in the world we want”. I also agree that if we knew how an AI would behave in advance, it would be less intelligent than a human. I feel like there is a gap to moving that there is __no__ pivotal output of an AGI. If I am stuck in a maze and build an AGI to help me find the way out, I cannot anticipate what exact path it will give me, but I can check whether the path leads out or not. So I think the general claim “there is no pivotal output … that is humanly checkable” is not properly justified here. I do feel like this would be the generic case, though, namely that the AGI could convince us of a plan and sneak in unintended consequences.
31. Agree. Seems conceptually related to 17: 17 is about affecting the inner properties of the system, 31 is about inspecting the inner properties.
32. Interesting point I haven’t seen elsewhere, namely “Words are not an AGI-complete data representation in its native style”. Not sure if it makes sense to give “true/false” status to the claim, but it pushes me a non-zero amount to the direction “alignment is hard”.
33. Agree. This is a statement which I could see many educated people nodding at, but which at least I find quite hard to feel on a gut level. (The Sequences contain helpful material on this, and apparently reading the right science fiction books would also help.)
34. Agree.
35. Agree. I guess there is also the scenario where one AGI has a decisive advantage over the other, but the outcome is the same: you cannot keep the AGIs in line by pitting them against each other.
36. Agree with the bolded part, the AI-box experiment is more than enough evidence for this.
37. Agree with “in the case of AGI safety, it is really important to have conservation of expected evidence about the difficulty of alignment”.
38. It does seem to me that “AGI safety” is a quite small subfield of “AI safety”, or you can see these as separate fields. I agree that the incentives are not in our/humanity’s favor.
39. I like this paragraph. I could nitpick about how the point of community building is that not everyone has to figure things out from the null string, but on the other hand I understand the view expressed here.
40. I have no clear view about how different the skills required for alignment are in contrast to more usual cognitively demanding work (other than that it is, well, hard). (I realize that I am biased—I found myself agreeing with “AGI risk is real” without much friction, but there are definitely many people who do not come to this conclusion.)
41. No comment.
42. I associate “There’s no plan” to the field being in a preparadigmatic state. I agree that it would be very much preferable if this weren’t the state of affairs, so that we could be in a position to design a plan.
43. This part hit home: “not an uncomfortable shrug and ‘How can you be sure that will happen’ / ‘There’s no way you could be sure of that now, we’ll have to wait on experimental evidence.’” I am sad that the Standard Response to AGI risk is “AI won’t be intelligent enough to do that”. (Not to say that there aren’t stronger counterarguments).
This is another reply in this vein, I’m quite new to this so don’t feel obliged to read through. I just told myself I will publish this.
I agree (90-99% agreement) with almost all of the points Eliezer made. And the rest is where I probably didn’t understand enough or where there’s no need for a comment, e.g.:
1. − 8. agree
9. Not sure if I understand it right—if the AGI has been successfully designed not to kill everyone then why need oversight? If it is capable to do so and the design fails then on the other hand what would our oversight do? I don’t think this is like the nuclear cores. Feels like it’s a bomb you are pretty sure won’t go off at random but if it does your oversight won’t stop it.
10. − 14. - agree
15. - I feel like I need to think about it more to honestly agree.
16. − 18. - agree
19. - to my knowledge, yes
20. − 23. - agree
24. - initially I put “80% agree” to the first part of the argument here (that
The complexity of what needs to be aligned or meta-aligned for our Real Actual Values is far out of reach for our FIRST TRY at AGI
but then discussing it with my reading group I reiterated this few times and begun to agree even more grasping the complexity of something like CEV.
25. − 29. - agree
30. - agree, although wasn’t sure about
an AI whose action sequence you can fully understand all the effects of, before it executes, is much weaker than humans in that domain
I think that the key part of this claim is “all the effects of” and I wasn’t sure whether we have to understand all, but of course we have to be sure one of the effects is not human extintion then yes, so for “solving alignment” also yes.
31. − 34. - agree
35. - no comment, I have to come back to this once I graps LDT better
36. - agree
37. - no comment, seems like a rant 😅
38. - agree
39. - ok, I guess
40. - agree, I’m glad some people want to experiment with the financing of research re 40.
41. - agree , although I agree with some of the top comments on this, e.g. evhub’s
Regarding 9: I believe it’s when you are successful enough that your AGI doesn’t instantly kill you immediately but it still can kill you in the process of using it. It’s in the context of a pivotal act, so it assumes you will operate it to do something significant and potentially dangerous.
Here is my honest reaction as another data point. (Well done by the parent for taking the initiative!)
Context: Got introduced to this field around a year ago. Not an expert.
My honest reaction is rather worried as well (to put it mildly).
1. I agree with this. My impression is that in many tasks we currently require a lot more data than humans, but I do not see any reason to expect that it will always be so.
2. I broadly agree with this. I am sympathetic to people who would like to see more of concrete stories about how exactly an AGI would take over the world (while there are some already, more wouldn’t hurt). Meanwhile,
- I believe that if effort is put into inventing such takeover scenarios, then one expects to come up with quite many of them. Hence, update already.
- I haven’t looked into nanobots myself, so no inside view there, but my prior is definitely on “there are lots of (causally) powerful technologies we haven’t invented yet”.
- The AI box experiment really feels like strong empirical evidence for the bootstrapping argument
3. I agree with this as stated. I do wonder, though, whether we will get any warning shots, where we operate at a semi-dangerous level and fail. This seems to reduce to slow vs. fast takeoff. (I don’t have a consistent opinion on that.)
4. Agree that there is a time limit. And indeed, recognition of the issue and cooperation from the relevant actors seems non-ideal.
5. Agree.
6. I’m not sure here—I agree that we should avoid the situation where we have multiple AGIs. If “pivotal act” is defined as an act which results in this outcome, then there is agreement, but as someone pointed out, it might be that the pivotal act is something which doesn’t fit the mental picture one associates with the words “pivotal act”.
7. I notice I am confused here: I’m not sure what “pivotal weak act” means, or what “something weak enough with an AGI to be *passively safe*” means. I agree with “no one knows of any pivotal act you could do with just current SOTA AI”. I don’t have good intuitions about the space of pivotal actions—I haven’t thought about it.
8. I interpret “problems we want an AI to solve” means problems relevant for pivotal acts. In this case, see above—I don’t have intuitions about pivotal acts.
9. See above.
10. Broadly agree.
11. Again, don’t know much about pivotal acts. (It is mentioned that “Pivotal weak acts like this aren’t known, and not for want of people looking for them.”—have I missed some big projects on pivotal acts.)
12. Agree.
13. Agree.
14. Agree. The discontinuity / “treacherous turn” seems obvious to me when thought about from first principles. The skeptic voice in my head says that nothing like that has happened in practice (to my knowledge), but that really does not assure me.
15. Broadly agree, though I lack good examples for the concept “alignment-required invariants”. My best guess: there is interpretability research on neural networks, and we have some non-trivial understanding there. That might turn out to be not relevant in case of a great new idea for capabilities.
16. I agree that the concept of inner alignment is important. There is an empirical verification for it. I am unsure about how big of a problem this will be in practice. I do appreciate the point about evolution.
17. I like this formulation (quite crisp), I don’t think I’ve seen it anywhere before. To me, it seems like an interesting idea to try to come up with ways for getting inner properties to systems.
18. Agree.
19. Agree.
20. Agree, except I don’t understand what “If you perfectly learn and perfectly maximize the referent of rewards assigned by human operators, that kills them.” means.
21. Not sure I get the central result, but I get the idea that in capabilities you have feedback loops in a different way from utility functions.
22. Agree.
23. A good, crisp formulation. Agree.
24. A good distinction, sure. In other words “let the AGI optimize something, no strings attached (and choose that “something” very carefully)” vs. “try to control/restrict the AGI”. I’m wondering whether there are any alternatives.
25. “We’ve got no idea” seems to me like a bit of an exaggeration, but I agree with the latter sentence.
26. Yep.
27. Yep, an instance of Goodhart’s law.
28. Yep.
29. I agree that this is the generic case—if you take a complex action sequence of AGI by random, it is almost surely uninterpretable by humans. Not sure what would happen if you optimized for plans in which humans are confident they understand the consequences. Sure, we have to fight against Goodhart’s law, and I do think that against sufficiently powerful cognitive systems our chances would be slim, but I’m not sure that one couldn’t extract enough information to perform a pivotal act. Failure at AI boxing does seem like a major bottleneck, though.
30. I agree up to “it knows … that some action sequence results in the world we want”. I also agree that if we knew how an AI would behave in advance, it would be less intelligent than a human. I feel like there is a gap to moving that there is __no__ pivotal output of an AGI. If I am stuck in a maze and build an AGI to help me find the way out, I cannot anticipate what exact path it will give me, but I can check whether the path leads out or not. So I think the general claim “there is no pivotal output … that is humanly checkable” is not properly justified here. I do feel like this would be the generic case, though, namely that the AGI could convince us of a plan and sneak in unintended consequences.
31. Agree. Seems conceptually related to 17: 17 is about affecting the inner properties of the system, 31 is about inspecting the inner properties.
32. Interesting point I haven’t seen elsewhere, namely “Words are not an AGI-complete data representation in its native style”. Not sure if it makes sense to give “true/false” status to the claim, but it pushes me a non-zero amount to the direction “alignment is hard”.
33. Agree. This is a statement which I could see many educated people nodding at, but which at least I find quite hard to feel on a gut level. (The Sequences contain helpful material on this, and apparently reading the right science fiction books would also help.)
34. Agree.
35. Agree. I guess there is also the scenario where one AGI has a decisive advantage over the other, but the outcome is the same: you cannot keep the AGIs in line by pitting them against each other.
36. Agree with the bolded part, the AI-box experiment is more than enough evidence for this.
37. Agree with “in the case of AGI safety, it is really important to have conservation of expected evidence about the difficulty of alignment”.
38. It does seem to me that “AGI safety” is a quite small subfield of “AI safety”, or you can see these as separate fields. I agree that the incentives are not in our/humanity’s favor.
39. I like this paragraph. I could nitpick about how the point of community building is that not everyone has to figure things out from the null string, but on the other hand I understand the view expressed here.
40. I have no clear view about how different the skills required for alignment are in contrast to more usual cognitively demanding work (other than that it is, well, hard). (I realize that I am biased—I found myself agreeing with “AGI risk is real” without much friction, but there are definitely many people who do not come to this conclusion.)
41. No comment.
42. I associate “There’s no plan” to the field being in a preparadigmatic state. I agree that it would be very much preferable if this weren’t the state of affairs, so that we could be in a position to design a plan.
43. This part hit home: “not an uncomfortable shrug and ‘How can you be sure that will happen’ / ‘There’s no way you could be sure of that now, we’ll have to wait on experimental evidence.’” I am sad that the Standard Response to AGI risk is “AI won’t be intelligent enough to do that”. (Not to say that there aren’t stronger counterarguments).
This is another reply in this vein, I’m quite new to this so don’t feel obliged to read through. I just told myself I will publish this.
I agree (90-99% agreement) with almost all of the points Eliezer made. And the rest is where I probably didn’t understand enough or where there’s no need for a comment, e.g.:
1. − 8. agree
9. Not sure if I understand it right—if the AGI has been successfully designed not to kill everyone then why need oversight? If it is capable to do so and the design fails then on the other hand what would our oversight do? I don’t think this is like the nuclear cores. Feels like it’s a bomb you are pretty sure won’t go off at random but if it does your oversight won’t stop it.
10. − 14. - agree
15. - I feel like I need to think about it more to honestly agree.
16. − 18. - agree
19. - to my knowledge, yes
20. − 23. - agree
24. - initially I put “80% agree” to the first part of the argument here (that
but then discussing it with my reading group I reiterated this few times and begun to agree even more grasping the complexity of something like CEV.
25. − 29. - agree
30. - agree, although wasn’t sure about
I think that the key part of this claim is “all the effects of” and I wasn’t sure whether we have to understand all, but of course we have to be sure one of the effects is not human extintion then yes, so for “solving alignment” also yes.
31. − 34. - agree
35. - no comment, I have to come back to this once I graps LDT better
36. - agree
37. - no comment, seems like a rant 😅
38. - agree
39. - ok, I guess
40. - agree, I’m glad some people want to experiment with the financing of research re 40.
41. - agree , although I agree with some of the top comments on this, e.g. evhub’s
42. - agree
43. - agree, at least this is what it feels like
Regarding 9: I believe it’s when you are successful enough that your AGI doesn’t instantly kill you immediately but it still can kill you in the process of using it. It’s in the context of a pivotal act, so it assumes you will operate it to do something significant and potentially dangerous.