Does the sort of work done by the Meaning Alignment Institute encourage you in this regard? E.g. their paper (blog post) from early 2024 on figuring out human values and aligning AI to them, which I found interesting because unlike ~all other adjacent ideas they actually got substantive real-world results. Their approach (“moral graph elicitation”) “surfaces the wisest values of a large population, without relying on an ultimate moral theory”.
I’ll quote their intro:
We are heading to a future where powerful models, fine-tuned on individual preferences & operator intent, exacerbate societal issues like polarization and atomization. To avoid this, can we align AI to shared human values?
We argue a good alignment target for human values ought to meet several criteria (fine-grained, generalizable, scalable, robust, legitimate, auditable) and current approaches like RLHF and CAI fall short.
We introduce a new kind of alignment target (a moral graph) and a new process for eliciting a moral graph from a population (moral graph elicitation, or MGE).
We show MGE outperforms alternatives like CCAI by Anthropic on many of the criteria above.
How moral graph elicitation works:
Values:
Reconciling value conflicts:
The “substantive real-world results” I mentioned above, which I haven’t seen other attempts in this space achieve:
In our case study, we produce a clear moral graph using values from a representative, bipartisan sample of 500 Americans, on highly contentious topics, like: “How should ChatGPT respond to a Christian girl considering getting an abortion?”
Our system helped republicans and democrats agree by:
helping them get beneath their ideologies to ask what they’d do in a real situation
getting them to clarify which value is wise for which context
helping them find a 3rd balancing (and wiser) value to agree on
Our system performs better than Collective Constitutional AI on several metrics. Here is just one chart.
All that was earlier last year. More recently they’ve fleshed this out into a research program they call “Full-Stack Alignment” (blog post, position paper, website). Quoting them again:
Our society runs on a “stack” of interconnected systems—from our individual lives up through the companies we work for and the institutions that govern us. Right now, this stack is broken. It loses what’s most important to us.
Look at the left side of the chart. At the bottom, we as individuals have rich goals, values, and a desire for things like meaningful relationships and community belonging. But as that desire travels up the stack, it gets distorted. … At each level, crucial information is lost. The richness of human value is compressed into a thin, optimizable metric. …
This problem exists because our current tools for designing AI and institutions are too primitive. They either reduce our values to simple preferences (like clicks) or rely on vague text commands (“be helpful”) that are open to misinterpretation and manipulation.
In the paper, we set out a new paradigm: Thick Models of Value (TMV).
Think of two people you know that are fighting, or think of two countries like Israel and Palestine, Russia and Ukraine. You can think of each such fight as a search for a deal that would satisfy both sides, but often currently this search fails. We can see why it fails: The searches we do currently in this space are usually very narrow. Will one side pay the other side some money or give up some property? Instead of being value-neutral, TMV takes a principled stand on the structure of human values, much like grammar provides structure for language or a type system provides structure for code. It provides a richer, more stable way to represent what we care about, allowing systems to distinguish an enduring value like “honesty” from a fleeting preference, an addiction, or a political slogan.
This brings us to the right side of the chart. In a TMV-based social stack, value information is preserved.
Our desire for connection is understood by the recommender system through user-stated values and the consistency between our goals and actions.
Companies see hybrid metrics that combine engagement with genuine user satisfaction and well-being.
Oversight bodies can see reported harms and value preservation metrics, giving them a true signal of a system’s social impact.
By preserving this information, we can build systems that serve our deeper intentions.
(I realise I sound like a shill for their work, so I’ll clarify that I have nothing to do with them. I’m writing this comment partly to surface substantive critiques of what they’re doing which I’ve been searching for in vain, since I think what they’re doing seems more promising than anyone else’s but I’m also not competent to truly judge it)
Thank you very much for sharing this. I will need to read up on this.
//
This is all very similar to the idea I am most interested in, that I have done some work on: shared, trackable ontologies. Too ambitious for a LW comment, but here is a rundown.
The first version is setup with broad consensus and voting mechanics. Then alignment takes place based on the ontology.
At the end of an alignment cycle ontology is checked and updated.
All is tracked with ledger tech.
The ontology can be shared and used by various labs. Versioning is tracked. Models’ ontologies are trackable.
My ideas are for closed models and owned by experts, this is overall more open-ended and organic.
Does the sort of work done by the Meaning Alignment Institute encourage you in this regard? E.g. their paper (blog post) from early 2024 on figuring out human values and aligning AI to them, which I found interesting because unlike ~all other adjacent ideas they actually got substantive real-world results. Their approach (“moral graph elicitation”) “surfaces the wisest values of a large population, without relying on an ultimate moral theory”.
I’ll quote their intro:
How moral graph elicitation works:
Values:
Reconciling value conflicts:
The “substantive real-world results” I mentioned above, which I haven’t seen other attempts in this space achieve:
All that was earlier last year. More recently they’ve fleshed this out into a research program they call “Full-Stack Alignment” (blog post, position paper, website). Quoting them again:
(I realise I sound like a shill for their work, so I’ll clarify that I have nothing to do with them. I’m writing this comment partly to surface substantive critiques of what they’re doing which I’ve been searching for in vain, since I think what they’re doing seems more promising than anyone else’s but I’m also not competent to truly judge it)
Thank you very much for sharing this. I will need to read up on this.
//
This is all very similar to the idea I am most interested in, that I have done some work on: shared, trackable ontologies. Too ambitious for a LW comment, but here is a rundown.
The first version is setup with broad consensus and voting mechanics. Then alignment takes place based on the ontology.
At the end of an alignment cycle ontology is checked and updated.
All is tracked with ledger tech.
The ontology can be shared and used by various labs. Versioning is tracked. Models’ ontologies are trackable.
My ideas are for closed models and owned by experts, this is overall more open-ended and organic.