Here are excerpts of a quick internal memo I wrote at Forethought for considering renaming the class of work that has variously been called “AI character” or “model spec.” Note that I ~don’t do object-level research directly on this field nor do I have direct decision-making power; I just sometimes get inspired by small random drive-by projects like this (and thought the effort:impact ratio here was good).
Why might we want a new name?
People don’t like the current name.
All else being equal, that’s a good reason to change the name especially in early stages, though we should be careful about being too willing to change our terms for other people.
“AI character” is confusing and overloaded as a phrase.
In particular, I think it connotes a strong position that the decision procedure on moral matters for models should depend on virtue ethics, which might work for Anthropic but a) is controversial broadly and b) is not necessary for arguing our case.
What criteria do we want in a new name?
Here are some criteria that I think is probably central for a term for the field (feel free to disagree!)
it governs the intended behavior of the model
ideally in a robustish way?
The name should be ecumenical on normative ethics as a decision procedure[1].
Note that the three ~most prominent existing cluster of terms each load a different normative-ethical theory: “AI character” leans virtue ethics, “alignment target” / “value loading” leans goal-directed consequentialism, and “model spec” leans (more weakly) rule-based deontology. A good replacement should avoid all three. [2]
Ideally it should try to completely punt the question of whether the AIs in question should be thought of as normative/moral agents guided fully by human ethics.
Analogy: an idealized scientist has some normative commitments but (s)he does not need the full scope of human ethics to be a good scientist
It is designed with high compressibility/communication value in mind
we’re not trying to legislate stuff like “the model prefers tabs to spaces in certain programming languages”
we want a field for describing the thing that can ideally be all of:
organizational mandate
can be used to define RL environments during training
and maybe the model can refer to itself if it’s unsure in production.
The name’s more about proclivities/propensity than about capabilities.
There are some other implicit criteria that’s relevant to naming in general of course.
Some candidate names that I think are better than “AI character”: AI dispositional targets or just AI dispositions, AI doctrine (as in “military doctrine”), propensity targeting, propensity specification, AI behavioral specification, AI behavioral target, ethos, charter, model conduct.
Some candidate names other people floated that I personally liked less than the ones above: alignment target, constitution, model spec.
In addition to all the usual reasons we might want to be precise about the denotations and connotations of a name, it’s also very relevant that current-gen AIs are already almost superhuman at picking up subtle vibes and connotations of vague words. So we mostly want to avoid subtly biasing the AIs in unconsidered “ways to be” if we can avoid it. Especially in something as central as “what we call the spec/character/constitution/alignment target etc”
I would gloss consequentialism/deontology/virtue ethics as:
Things an entity can cause that are good
Ways an entity can act that are good
Ways an entity can be that are good
These concepts still apply even to narrower notions of good (as with the scientist example), and aren’t fundamentally at odds with each other either. I think it makes sense to use each concept when it makes sense, instead of trying to avoid them altogether.
The sort of intervention that this work is aiming at is not on the actual things the AI does (you can’t actually intervene on this directly), nor is it on the ways it can act (this would be something like “guardrails” which force refusals), but it is instead trying to shape the sort of entity that the AI is. So of course virtue-ethics is going to be a natural frame.
It’s complicated but I want to separate something like the field’s question: “What ~propensities do we inculcate in an AI such that “good things happen”?” from a tentatively plausible answer like “the propensities are similar to treating the AI as if it was a virtuous human.[1]”
For example, the “virtues” in an encyclopedia are different than the virtues in a knowledgeable person. Saying the full truth is very important in an encyclopedia whereas tact is less important.
Similarly, there are a bunch of convergent reasons to think that we ought to want our LLMs to be unusually truthseeking in their utterances relative to similarly intelligent humans.
As another, more controversial example, consider a gun vs a soldier. “Reliability” is a very important virtue in a gun, possibly the most important. Reliability is also important in a soldier, but we don’t want soldiers in to just follow orders if the orders are illegal and unconstitutional. In at least some cases we may also want soldiers to refuse legal and constitutional orders if they go sufficiently against the soldiers’ conscience. I think it’s an open question right now whether the design of robot soldiers/autonomous drones should make them behave closer to that of a gun or closer to a human soldier in a free society. Even if you strongly believe robot soldiers should act like human soldiers when given illegal and/or immoral orders, you could hopefully still recognize that this is somewhat of an open question right now.
“Entity” is doing some quiet heavy lifting here; this is not a principled Machine Learning term—it is a semiotic choice that makes virtue ethics feel natural.
Having just re-read the AI character post and comments, I don’t really get what the described line of research is actually trying to achieve, and other people also seem confused about it. It’s hard to pick a descriptive name of something if I only have a vague understanding of what it is I’m describing.
For example I feel that I have some understanding of what the terms “alignment target” and “constitution” and (kinda) “model spec” mean, and they don’t mean the same thing, and I wouldn’t use them to point at the same research agenda.
“Alignment target” is the term that describes what I think is the best subject of research (out of the three). That is, I care a lot more about what goal ASI is actually pointed at, than I care about the contents of a model’s constitution.
Roughly, the line of research is something like “what propensities do we want to train the AI models to have?”, especially in near-term/the next few generations. With the hope that this either generalizes well by seeding and inertia or otherwise has a bunch of other positive downstream consequences listed in the original post.
“Alignment target” the way you define it[1] is very important but also far-mode. And we might hope that the AIs can help us a lot on the way to ASI (e.g. via CEV or some other assisted reflection process). So it’s possible we can punt this question to our future selves with AI assistance.
(Though I think it’s plausible that significantly more people should be thinking about the post-ASI alignment target right now as well, I just haven’t really thought about that question recently)
Here are excerpts of a quick internal memo I wrote at Forethought for considering renaming the class of work that has variously been called “AI character” or “model spec.” Note that I ~don’t do object-level research directly on this field nor do I have direct decision-making power; I just sometimes get inspired by small random drive-by projects like this (and thought the effort:impact ratio here was good).
Why might we want a new name?
People don’t like the current name.
All else being equal, that’s a good reason to change the name especially in early stages, though we should be careful about being too willing to change our terms for other people.
“AI character” is confusing and overloaded as a phrase.
In particular, I think it connotes a strong position that the decision procedure on moral matters for models should depend on virtue ethics, which might work for Anthropic but a) is controversial broadly and b) is not necessary for arguing our case.
What criteria do we want in a new name?
Here are some criteria that I think is probably central for a term for the field (feel free to disagree!)
it governs the intended behavior of the model
ideally in a robustish way?
The name should be ecumenical on normative ethics as a decision procedure[1].
Note that the three ~most prominent existing cluster of terms each load a different normative-ethical theory: “AI character” leans virtue ethics, “alignment target” / “value loading” leans goal-directed consequentialism, and “model spec” leans (more weakly) rule-based deontology. A good replacement should avoid all three. [2]
Ideally it should try to completely punt the question of whether the AIs in question should be thought of as normative/moral agents guided fully by human ethics.
Analogy: an idealized scientist has some normative commitments but (s)he does not need the full scope of human ethics to be a good scientist
It is designed with high compressibility/communication value in mind
we’re not trying to legislate stuff like “the model prefers tabs to spaces in certain programming languages”
we want a field for describing the thing that can ideally be all of:
organizational mandate
can be used to define RL environments during training
and maybe the model can refer to itself if it’s unsure in production.
The name’s more about proclivities/propensity than about capabilities.
There are some other implicit criteria that’s relevant to naming in general of course.
Some candidate names that I think are better than “AI character”: AI dispositional targets or just AI dispositions, AI doctrine (as in “military doctrine”), propensity targeting, propensity specification, AI behavioral specification, AI behavioral target, ethos, charter, model conduct.
Some candidate names other people floated that I personally liked less than the ones above: alignment target, constitution, model spec.
In addition to all the usual reasons we might want to be precise about the denotations and connotations of a name, it’s also very relevant that current-gen AIs are already almost superhuman at picking up subtle vibes and connotations of vague words. So we mostly want to avoid subtly biasing the AIs in unconsidered “ways to be” if we can avoid it. Especially in something as central as “what we call the spec/character/constitution/alignment target etc”
Boaz makes a similar point here: https://www.lesswrong.com/posts/nBEBCtgGGKrhuGmxb/thoughts-on-claude-s-constitution
I think “AI character” is a good name actually.
I would gloss consequentialism/deontology/virtue ethics as:
Things an entity can cause that are good
Ways an entity can act that are good
Ways an entity can be that are good
These concepts still apply even to narrower notions of good (as with the scientist example), and aren’t fundamentally at odds with each other either. I think it makes sense to use each concept when it makes sense, instead of trying to avoid them altogether.
The sort of intervention that this work is aiming at is not on the actual things the AI does (you can’t actually intervene on this directly), nor is it on the ways it can act (this would be something like “guardrails” which force refusals), but it is instead trying to shape the sort of entity that the AI is. So of course virtue-ethics is going to be a natural frame.
It’s complicated but I want to separate something like the field’s question: “What ~propensities do we inculcate in an AI such that “good things happen”?” from a tentatively plausible answer like “the propensities are similar to treating the AI as if it was a virtuous human.[1]”
For example, the “virtues” in an encyclopedia are different than the virtues in a knowledgeable person. Saying the full truth is very important in an encyclopedia whereas tact is less important.
Similarly, there are a bunch of convergent reasons to think that we ought to want our LLMs to be unusually truthseeking in their utterances relative to similarly intelligent humans.
As another, more controversial example, consider a gun vs a soldier. “Reliability” is a very important virtue in a gun, possibly the most important. Reliability is also important in a soldier, but we don’t want soldiers in to just follow orders if the orders are illegal and unconstitutional. In at least some cases we may also want soldiers to refuse legal and constitutional orders if they go sufficiently against the soldiers’ conscience. I think it’s an open question right now whether the design of robot soldiers/autonomous drones should make them behave closer to that of a gun or closer to a human soldier in a free society. Even if you strongly believe robot soldiers should act like human soldiers when given illegal and/or immoral orders, you could hopefully still recognize that this is somewhat of an open question right now.
Either the same set of virtues as people or a subset or a superset.
“Entity” is doing some quiet heavy lifting here; this is not a principled Machine Learning term—it is a semiotic choice that makes virtue ethics feel natural.
Having just re-read the AI character post and comments, I don’t really get what the described line of research is actually trying to achieve, and other people also seem confused about it. It’s hard to pick a descriptive name of something if I only have a vague understanding of what it is I’m describing.
For example I feel that I have some understanding of what the terms “alignment target” and “constitution” and (kinda) “model spec” mean, and they don’t mean the same thing, and I wouldn’t use them to point at the same research agenda.
“Alignment target” is the term that describes what I think is the best subject of research (out of the three). That is, I care a lot more about what goal ASI is actually pointed at, than I care about the contents of a model’s constitution.
Roughly, the line of research is something like “what propensities do we want to train the AI models to have?”, especially in near-term/the next few generations. With the hope that this either generalizes well by seeding and inertia or otherwise has a bunch of other positive downstream consequences listed in the original post.
“Alignment target” the way you define it[1] is very important but also far-mode. And we might hope that the AIs can help us a lot on the way to ASI (e.g. via CEV or some other assisted reflection process). So it’s possible we can punt this question to our future selves with AI assistance.
(Though I think it’s plausible that significantly more people should be thinking about the post-ASI alignment target right now as well, I just haven’t really thought about that question recently)
which tbc I think is also the historical definition.
I suspect that it can’t be punted. e.g. see Clarifying “wisdom”: Foundational topics for aligned AIs to prioritize before irreversible decisions and Problems I’ve Tried to Legibilize