I’m starting an internal red-teaming project at Forethought against Forethought’s AI propensity targeting/modelcharacter work. The theory of change for internal red-teaming is that someone (me!) spending dedicated focused time on the negative case can find and elaborate on holes that Forethought’s existing feedback processes will miss.
Possible good outcomes from the red-teaming project (roughly in decreasing order of expected impact):
I identify and come up with good reasons to wind down Forethought’s work in this field iff it is correct to do so.
Note that “identify” in this case may involve original research, but it may instead be mostly about sourcing the ideas from other people.
I identify ways the current direction of propensity targeting work is bad and suggest significant directional changes.
People don’t change high-level direction, but I understand and can make the counterarguments to this work legible enough within Forethought, so ppl are aware of the counterarguments and make less mistakes in their research.
I change my own mind significantly on the case for this work and decide to double down on it myself.
I identify sufficiently concrete cruxes that we can use to decide whether to prioritize this work in the future (eg stop/wind down this work if XYZ happens)
I think I’d find it very helpful to understand why people are skeptical of AI propensity targeting work either in general or by Forethought in particular.
For me, the most useful critiques will elaborate on why the work is net-negative (ideally significantly) or why it’s almost certain to be useless. Though other critiques are welcome as well.
I’d really appreciate comments from LessWrong people and others! Especially helpful is links to longer write-ups/existing comments. I might also want to chat on a call with people if anybody’s interested.
I’ve already read comments here, here, and various critical writings on Anthropic’s Constitution and “soul doc.”
I’m starting an internal red-teaming project at Forethought against Forethought’s AI propensity targeting/model character work. The theory of change for internal red-teaming is that someone (me!) spending dedicated focused time on the negative case can find and elaborate on holes that Forethought’s existing feedback processes will miss.
Possible good outcomes from the red-teaming project (roughly in decreasing order of expected impact):
I identify and come up with good reasons to wind down Forethought’s work in this field iff it is correct to do so.
Note that “identify” in this case may involve original research, but it may instead be mostly about sourcing the ideas from other people.
I identify ways the current direction of propensity targeting work is bad and suggest significant directional changes.
People don’t change high-level direction, but I understand and can make the counterarguments to this work legible enough within Forethought, so ppl are aware of the counterarguments and make less mistakes in their research.
I change my own mind significantly on the case for this work and decide to double down on it myself.
I identify sufficiently concrete cruxes that we can use to decide whether to prioritize this work in the future (eg stop/wind down this work if XYZ happens)
I think I’d find it very helpful to understand why people are skeptical of AI propensity targeting work either in general or by Forethought in particular.
For me, the most useful critiques will elaborate on why the work is net-negative (ideally significantly) or why it’s almost certain to be useless. Though other critiques are welcome as well.
I’d really appreciate comments from LessWrong people and others! Especially helpful is links to longer write-ups/existing comments. I might also want to chat on a call with people if anybody’s interested.
I’ve already read comments here, here, and various critical writings on Anthropic’s Constitution and “soul doc.”