Why should we trust an agent with integrity more than one that is compliant with rules?
This seems too strong to me. At the end you say that ‘Integrity doesn’t speak to the goodness of values’, but it seems like in the rest of the post you’re not really taking that into account. Integrity does seem important to me, and I appreciate the pointer to it (and Velleman) as a useful framing for an important property. But it seems somewhat orthogonal to the question of what values are, and integrity alone says very little about whether we can trust an agent (to be clear, I do think that the Claude constitution specifies values). As a result, passages like the quoted one above seem misleading.
The constitution’s values currently exist in natural language with no formal account of what makes something count as a value, how values relate, or how they should be revised. The aforementioned breakdown of honesty is moving in the right direction. But it still lacks a type system.
The alternative is structured representations that specify the grammar by which values can be expressed, compared, and updated
At the risk of being an over-literal programmer, even after skimming the ‘full-stack’ paper, I have no idea what this means. Is there somewhere that you give concrete examples of a type system for values, or an appropriate structured representation, or (from the paper) a grammar for values? It seems like you’re drawing on terms from computer science and programming language design (unless that’s coincidental) but I don’t understand what those terms mean in this context.
For an example of a “type system for values” (agree it’s an imperfect computer sciency reference), check out the notion of attentional policies and values cards in this paper: https://arxiv.org/pdf/2404.10636
What we mean is that many types of things commonly gets called “values”: matters of taste, social norms, ideological slogans, etc. You may have a preference for “mini skirts” or “brownies over cookies”, but most people would agree those aren’t really values in the same way “honesty” or “creativity” are; they don’t say anything substantial about how you want to live, or what you think is important in life.
but words like “honesty” are too loose; one person may think honesty is never lying, for someone else honesty means “speaking from a place of authenticity at all times”, for someone else it’s something about epistemic humility.
Hence the need for a tighter type system than just strings, if we want to structurally reason over these.
Interesting post, thanks.
This seems too strong to me. At the end you say that ‘Integrity doesn’t speak to the goodness of values’, but it seems like in the rest of the post you’re not really taking that into account. Integrity does seem important to me, and I appreciate the pointer to it (and Velleman) as a useful framing for an important property. But it seems somewhat orthogonal to the question of what values are, and integrity alone says very little about whether we can trust an agent (to be clear, I do think that the Claude constitution specifies values). As a result, passages like the quoted one above seem misleading.
At the risk of being an over-literal programmer, even after skimming the ‘full-stack’ paper, I have no idea what this means. Is there somewhere that you give concrete examples of a type system for values, or an appropriate structured representation, or (from the paper) a grammar for values? It seems like you’re drawing on terms from computer science and programming language design (unless that’s coincidental) but I don’t understand what those terms mean in this context.
Thanks!
For an example of a “type system for values” (agree it’s an imperfect computer sciency reference), check out the notion of attentional policies and values cards in this paper: https://arxiv.org/pdf/2404.10636
What we mean is that many types of things commonly gets called “values”: matters of taste, social norms, ideological slogans, etc. You may have a preference for “mini skirts” or “brownies over cookies”, but most people would agree those aren’t really values in the same way “honesty” or “creativity” are; they don’t say anything substantial about how you want to live, or what you think is important in life.
but words like “honesty” are too loose; one person may think honesty is never lying, for someone else honesty means “speaking from a place of authenticity at all times”, for someone else it’s something about epistemic humility.
Hence the need for a tighter type system than just strings, if we want to structurally reason over these.