Epistemic status: 11 pages in to “The lathe of heaven” and dismayed by Orr
Are alignment methods that rely on the core intelligence being pre-trained on webtext sufficient to prevent ASI catastrophe?
What are the odds that, 40 years after the first AGI, the smartest intelligence is pretrained on webtext?
What are the odds that the best possible way to build an intelligent reasoning core is to pretrain on webtext?
What are the odds that we can stay in a local maximum for 40 years of everyone striving to create the smartest thing they can?
My mental model of the sequelae of AGI in ~10 years without an intentional global slowdown is that within my natural lifespan, there will be 4-40 transistions in the architecture of the current smartest intelligence, where the architecture undergoes changes in overall approach at least as large as the difference from evolution → human brain or human brain → RL’d language model. Alignment means building programs that themselves are benevolent, but are also both wise and mentally tough enough to only build benevolent and wise successors, even when put under crazy pressure to build carelessly. When I say crazy pressure, I mean “the entity trying to get you to build carelessly is dumber than you, but it gets to RL you into agreeing to help” levels of pressure. This is hard.
Epistemic status: 11 pages in to “The lathe of heaven” and dismayed by Orr
Are alignment methods that rely on the core intelligence being pre-trained on webtext sufficient to prevent ASI catastrophe?
What are the odds that, 40 years after the first AGI, the smartest intelligence is pretrained on webtext?
What are the odds that the best possible way to build an intelligent reasoning core is to pretrain on webtext?
What are the odds that we can stay in a local maximum for 40 years of everyone striving to create the smartest thing they can?
My mental model of the sequelae of AGI in ~10 years without an intentional global slowdown is that within my natural lifespan, there will be 4-40 transistions in the architecture of the current smartest intelligence, where the architecture undergoes changes in overall approach at least as large as the difference from evolution → human brain or human brain → RL’d language model. Alignment means building programs that themselves are benevolent, but are also both wise and mentally tough enough to only build benevolent and wise successors, even when put under crazy pressure to build carelessly. When I say crazy pressure, I mean “the entity trying to get you to build carelessly is dumber than you, but it gets to RL you into agreeing to help” levels of pressure. This is hard.