A general method for identifying dangers: For every topic which gets discussed on AF, figure out what could go wrong if GPT-N decided to write a post on that topic.
GPT-N writes a post about fun theory. It illustrates principles of fun theory by describing an insanely fun game you can play with an ordinary 52-card deck. FAI work gets pushed aside as everyone becomes hooked on this new game. (Procrastination is an existential threat!)
GPT-N writes a post about human safety problems. To motivate its discussion, it offers some extraordinarily compelling reasons why the team which creates the first AGI might want to keep the benefits to themselves.
GPT-N writes a post about wireheading. In the “Human Wireheading” section, it describes an incredibly easy and pleasurable form of meditation. Soon everyone is meditating 24⁄7.
GPT-N writes a post about s-risks. Everyone who reads it gets a bad case of PTSD.
GPT-N writes a post about existential hope. Everyone who reads it becomes unbearably impatient for the posthuman era. Security mindset becomes a thing of the past. Alternatively, everyone’s motivation for living in the present moment gets totally sapped. There are several high-profile suicides.
GPT-N has an incredibly bad take on decision theory, game theory, and blackmail. It gets deleted from AF. The Streisand effect occurs and millions of people read it.
GPT-N offers a very specific answer to the question “What specific dangers arise when asking GPT-N to write an Alignment Forum post?”
For the prompt you provided, one risk would be that GPT-N says the best way to audit cognition is to look for each of these 10 different types of nefarious activity, and in describing the 10 types, it ends up writing something nefarious.
GPT-N might inadvertently write a post which presents an incredibly compelling argument for an incorrect and harmful conclusion (“FAI work doesn’t matter because FAI is totally impossible”), but one hopes that you could simply use GPT-N to write a counterargument to that post to see if the conclusion is actually solid. (Seems like good practice for GPT-N posts in general.)
A general method for identifying dangers: For every topic which gets discussed on AF, figure out what could go wrong if GPT-N decided to write a post on that topic.
GPT-N writes a post about fun theory. It illustrates principles of fun theory by describing an insanely fun game you can play with an ordinary 52-card deck. FAI work gets pushed aside as everyone becomes hooked on this new game. (Procrastination is an existential threat!)
GPT-N writes a post about human safety problems. To motivate its discussion, it offers some extraordinarily compelling reasons why the team which creates the first AGI might want to keep the benefits to themselves.
GPT-N writes a post about wireheading. In the “Human Wireheading” section, it describes an incredibly easy and pleasurable form of meditation. Soon everyone is meditating 24⁄7.
GPT-N writes a post about s-risks. Everyone who reads it gets a bad case of PTSD.
GPT-N writes a post about existential hope. Everyone who reads it becomes unbearably impatient for the posthuman era. Security mindset becomes a thing of the past. Alternatively, everyone’s motivation for living in the present moment gets totally sapped. There are several high-profile suicides.
GPT-N has an incredibly bad take on decision theory, game theory, and blackmail. It gets deleted from AF. The Streisand effect occurs and millions of people read it.
GPT-N offers a very specific answer to the question “What specific dangers arise when asking GPT-N to write an Alignment Forum post?”
For the prompt you provided, one risk would be that GPT-N says the best way to audit cognition is to look for each of these 10 different types of nefarious activity, and in describing the 10 types, it ends up writing something nefarious.
GPT-N might inadvertently write a post which presents an incredibly compelling argument for an incorrect and harmful conclusion (“FAI work doesn’t matter because FAI is totally impossible”), but one hopes that you could simply use GPT-N to write a counterargument to that post to see if the conclusion is actually solid. (Seems like good practice for GPT-N posts in general.)