Report on an experiment in playing builder-breaker games with language models to brainstorm and critique research ideas
---
Today I had the thought: “What lessons does human upbringing have for AI alignment?”
Human upbringing is one of the best alignment systems that currently exist. Using this system we can take a bunch of separate entities (children) and ensure that, when they enter the world, they can become productive members of society. They respect ethical, societal, and legal norms and coexist peacefully. So what are we ‘doing right’ and how does this apply to AI alignment?
There are a lot of problems with this story of course, such as overly anthropomorphising the AI in question. And overall the story just seems fairly naive / not really grounded in what we know about current frontier models.
Some themes emerged which seemed interesting and plausible to me were that:
Children have trusted mentors who help them contextualize their experiences through the lens of human values. They are helped to understand that sometimes bad things happen, but also to understand the reasons these things happen. I.e. experience (‘data’) is couched in a value-oriented framework.
This guidance can look like socratic questioning, i.e. being guided to think through the implications of your ideas / statements.
I think these are worth pondering in the context of AI alignment.
One critical flaw in the story is that it seems to assume that AIs will be altruistic. Without this central assumption the story kind of falls apart. That might suggest that “how to make AIs be altruistic towards humans” is an important question worth thinking about.
---
Meta-report. What did I learn from doing this exercise? The object level findings and ideas are pretty speculative and it’s not clear how good they are. But I think AI is an underrated source of creativity / inspiration for novel ideas. (There’s obviously a danger of being misled, which is why it’s important to remain skeptical, but this is already true in real life anyway.)
It’s worth noting that this ended up being kind of like a builder-breaker game, but with AI taking the role of the builder. I think stuff like this is worth doing more of, if only to get better at identifying sketchy parts of otherwise plausible arguments. See also; the importance of developing good taste
Report on an experiment in playing builder-breaker games with language models to brainstorm and critique research ideas
---
Today I had the thought: “What lessons does human upbringing have for AI alignment?”
Human upbringing is one of the best alignment systems that currently exist. Using this system we can take a bunch of separate entities (children) and ensure that, when they enter the world, they can become productive members of society. They respect ethical, societal, and legal norms and coexist peacefully. So what are we ‘doing right’ and how does this apply to AI alignment?
---
To help brainstorm some ideas here I first got Claude to write a short story about an AI raised as a human child. (If you want to, you can read it first).
There are a lot of problems with this story of course, such as overly anthropomorphising the AI in question. And overall the story just seems fairly naive / not really grounded in what we know about current frontier models.
Some themes emerged which seemed interesting and plausible to me were that:
Children have trusted mentors who help them contextualize their experiences through the lens of human values. They are helped to understand that sometimes bad things happen, but also to understand the reasons these things happen. I.e. experience (‘data’) is couched in a value-oriented framework.
This guidance can look like socratic questioning, i.e. being guided to think through the implications of your ideas / statements.
I think these are worth pondering in the context of AI alignment.
One critical flaw in the story is that it seems to assume that AIs will be altruistic. Without this central assumption the story kind of falls apart. That might suggest that “how to make AIs be altruistic towards humans” is an important question worth thinking about.
---
Meta-report. What did I learn from doing this exercise? The object level findings and ideas are pretty speculative and it’s not clear how good they are. But I think AI is an underrated source of creativity / inspiration for novel ideas. (There’s obviously a danger of being misled, which is why it’s important to remain skeptical, but this is already true in real life anyway.)
It’s worth noting that this ended up being kind of like a builder-breaker game, but with AI taking the role of the builder. I think stuff like this is worth doing more of, if only to get better at identifying sketchy parts of otherwise plausible arguments. See also; the importance of developing good taste
Note that “thinking through implications” for alignment is exactly the idea in deliberative alignment https://openai.com/index/deliberative-alignment/
Some notes
Authors claim that just prompting o1 with full safety spec already allows it to figure out aligned answers
The RL part is simply “distilling” this imto the model (see also, note from a while ago on RLHF as variational inference)
Generates prompt, CoT, outcome traces from the base reasoning model. Ie data is collected “on-policy”
Uses a safety judge instead of human labelling