Has anyone written an essay about how to fight against/correct for Trapped Priors? I would like to do something like that, but I want to make sure that I’m not reinventing the wheel here. Thank you!
I keep running into conceptual confusion around the term “alignment,” particularly when reading older Less Wrong posts. Some people say “aligned AI” and mean “an AI that works for human flourishing,” some people say that an AI “is aligned” if it reliably advances the intended objectives of some person or group (and doesn’t have some secret set of goals / isn’t scheming), and yet other people use “alignment” to mean something along the lines of “the ability of any system to reliably work towards some pre-defined goal.” I usually have to work out which is being said on the spot, which is annoying given that the implications of each are very different.
Is there one commonly accepted definition? Is this confusion just a thing we’ve all accepted?
You need to successfully point the AI at anything at all. (This may superficially seem like it’s working with current LLMs, but it isn’t actually anywhere close to robust enough to hold up)
You need to point the AI at some kind of nuanced abstract target, in particular, that remains stable as the AI updates its ontology.
(You also eventually need to point the AI at a cluster of messy human-value-concepts in particular. Though from what I gather, MIRI-ish people think if you get the first two things, this last part isn’t actually that hard)
An aligned AI is the one who is successfully pointed by humans to a goal. If mankind does solve alignment, then a power struggle over which goals the AI serves may have an effect on the world. Otherwise the AI pursues the goals which mankind never set, and the humans are wiped out or disempowered.
Gotcha. Is there a strong reason to assume that we’ll succeed at creating AIs that can be pointed at a single target? I read this post and comment a while back and would love your thoughts.
Thought in progress: epistemic humility is not a substitute for actual humility (or professed humility). You only get to cry wolf once, but you can probably warn about potential wolves several times—so long as you don’t burn goodwill on an incorrect or overconfident prediction.
I think epistemic humility helps to increase trust and confidence in EA/Less Wrong-type spaces, but I think professed humility is far more helpful when it comes to public-facing AI comms, particularly as scenarios get more intense and specific (e.g. prefacing AI doom predictions with a decent amount of throat-clearing beforehand commensurate with the intensity and specificity of the forecast). For example, I think that AI 2027 might have been better received if the authors had spent less time trying to convince the readers of their credibility at the beginning and spent more time saying something along the lines of “we know this sounds crazy and are well aware of how sci-fi the scenario seems”. (I’m not a huge fan of lampshading in fiction, but IRL, I think you do need to display self-awareness of outlandishness in order to be taken seriously, particularly if what you’re predicting sounds insane to the average person.)
Of course, there are huge diminishing returns on this: the more throat-clearing you do, the less confident you seem. And throat-clearing should probably be saved for public-facing comms, because actual technical work seems to require people who are confident in their beliefs even when they are outlandish (as proven by the outlandish explosion of AI progress recently).
Still, I think that the AI safety community at large has a worse reputation than they deserve, and I think part of that is due to the appearance of overconfidence. This problem seems simple, tractable, and important.
Has anyone made a proper post about potential “warning shots” and how we should prepare for them? This post has lived rent-free in my head for the past couple of months and I’m curious to know if anyone else has been thinking about this topic too.
Has anyone written an essay about how to fight against/correct for Trapped Priors? I would like to do something like that, but I want to make sure that I’m not reinventing the wheel here. Thank you!
I keep running into conceptual confusion around the term “alignment,” particularly when reading older Less Wrong posts. Some people say “aligned AI” and mean “an AI that works for human flourishing,” some people say that an AI “is aligned” if it reliably advances the intended objectives of some person or group (and doesn’t have some secret set of goals / isn’t scheming), and yet other people use “alignment” to mean something along the lines of “the ability of any system to reliably work towards some pre-defined goal.” I usually have to work out which is being said on the spot, which is annoying given that the implications of each are very different.
Is there one commonly accepted definition? Is this confusion just a thing we’ve all accepted?
As Raemon put it,
You need to successfully point the AI at anything at all. (This may superficially seem like it’s working with current LLMs, but it isn’t actually anywhere close to robust enough to hold up)
You need to point the AI at some kind of nuanced abstract target, in particular, that remains stable as the AI updates its ontology.
(You also eventually need to point the AI at a cluster of messy human-value-concepts in particular. Though from what I gather, MIRI-ish people think if you get the first two things, this last part isn’t actually that hard)
An aligned AI is the one who is successfully pointed by humans to a goal. If mankind does solve alignment, then a power struggle over which goals the AI serves may have an effect on the world. Otherwise the AI pursues the goals which mankind never set, and the humans are wiped out or disempowered.
Gotcha. Is there a strong reason to assume that we’ll succeed at creating AIs that can be pointed at a single target? I read this post and comment a while back and would love your thoughts.
Thought in progress: epistemic humility is not a substitute for actual humility (or professed humility). You only get to cry wolf once, but you can probably warn about potential wolves several times—so long as you don’t burn goodwill on an incorrect or overconfident prediction.
I think epistemic humility helps to increase trust and confidence in EA/Less Wrong-type spaces, but I think professed humility is far more helpful when it comes to public-facing AI comms, particularly as scenarios get more intense and specific (e.g. prefacing AI doom predictions with a decent amount of throat-clearing beforehand commensurate with the intensity and specificity of the forecast). For example, I think that AI 2027 might have been better received if the authors had spent less time trying to convince the readers of their credibility at the beginning and spent more time saying something along the lines of “we know this sounds crazy and are well aware of how sci-fi the scenario seems”. (I’m not a huge fan of lampshading in fiction, but IRL, I think you do need to display self-awareness of outlandishness in order to be taken seriously, particularly if what you’re predicting sounds insane to the average person.)
Of course, there are huge diminishing returns on this: the more throat-clearing you do, the less confident you seem. And throat-clearing should probably be saved for public-facing comms, because actual technical work seems to require people who are confident in their beliefs even when they are outlandish (as proven by the outlandish explosion of AI progress recently).
Still, I think that the AI safety community at large has a worse reputation than they deserve, and I think part of that is due to the appearance of overconfidence. This problem seems simple, tractable, and important.
I’m a little surprised by the amount of disagree reacts, given that no one has replied.
Has anyone made a proper post about potential “warning shots” and how we should prepare for them? This post has lived rent-free in my head for the past couple of months and I’m curious to know if anyone else has been thinking about this topic too.