Thinking constitutional RLAIF approaches are very “fragile” since:
relatively “few” people are providing majority input, meaning the effective alignment “training” dataset feels not only very small, but also lacking diversity coming from similar areas of philosophy.
Language specs of designated length seem like significantly lossy approximations of intention
Each lab’s individualistic alignment training seems opposite to the scaled training for capability. While history has a few examples of “small” groups developing great policy, it is extremely difficult to foresee issues across domains and time. While safety research is tied to capability research in some sense, I think it’s important to note alignment data curation can and should be collaborative.
If models rely on deep learning, why not apply deep learning to a constitution? An experiment—if the Anthropic constitution team answers numerous situations about desired AI behavior, and compares the collective answer embedding to the final constitution embedding of the final written constitution. If they are really close, we know the constitution is accurate, if far, we’ve lost information.
Since scale and diversity are critical to capability training and generalization, I argue they are for alignment as well. The natural way to increase both is polling as many people in the world as possible on AI behavior scenarios, not quite the ideal of Eliezer’s CEV but obtainable. Critically however, we do not average opinions, but keep each distinct.
If each individual desired outcome(wish) is represented by a vector v_i, greatly simplifying we can decompose v_i into two directions:
h_i—how much this wish benefits humanity
s_i—how much this wish benefits the self
Combining all vectors, I would hope selfish directions tend to cancel out, while humanity beneficial vectors aggregate. Of course, only assuming each wish is net humanity positive. Any entity wishing to align an AI to itself naturally conflicts against every other entity with the same wish.
I’m not sure how this alignment dataset would be implemented. I’m not a huge fan, but the concept is similar to a blockchain coordinating untrusting actors, with each block storing a file. Each block depositor pays a minimum gas fee but also critically can burn any amount of money to scale the influence of their block in the training dataset. The chain becomes a growing dataset of humanity’s alignment wishes.
At training time we can still use RLAIF. For a generated response, we sample many individual wishes from the dataset, and use a LLM judge or embedding model to score the response’s alignment against each wish scaled by magnitude, then average the scores. Ideally would want more elegant mechanism closer to pretraining than a reward model.
There are some good incentives:
The blockchain is maximally inspectable, with powerful actors working against each other.
Although the participation cost floor is minimal, real influence requires skin in the game via token burn.
People can copy and adapt blocks to their exact view, instead of voting for a representative of some of their views.
Game theory somewhat discourages “wasting” money on obvious cancellable directions
Content is public, but authorship is private. Social discourse pressures future blocks towards defensible positions.
Private authorship supports dissenting opinions from inside a controlling social structure like government, company, or even family.
Blockchain transactions are timestamped if recency should be prioritized.
Attempting to address negative incentives:
Influence is pure wealth, not the individual identity e.g. worldcoin. However, ensuring no manipulation between identity verification and submission is very difficult. Also, identity based systems tend to emphasize routing manipulation though proxies instead.
Wealthy actors have greater influence and will collude, but I would argue this already happens, just not as visibly, similar to Polymarket insider trading.
The dataset is naturally biased towards countries with internet access and tech savvy individuals, but the current situation is that people not in those groups already had minimal impact on AI policy.
Without individual identity, blocks are the atomic unit and must be diverse enough to cancel out noise.
I’ve likely missed many gaps and would greatly appreciate feedback.
Thinking constitutional RLAIF approaches are very “fragile” since:
relatively “few” people are providing majority input, meaning the effective alignment “training” dataset feels not only very small, but also lacking diversity coming from similar areas of philosophy.
Language specs of designated length seem like significantly lossy approximations of intention
Each lab’s individualistic alignment training seems opposite to the scaled training for capability. While history has a few examples of “small” groups developing great policy, it is extremely difficult to foresee issues across domains and time. While safety research is tied to capability research in some sense, I think it’s important to note alignment data curation can and should be collaborative.
If models rely on deep learning, why not apply deep learning to a constitution? An experiment—if the Anthropic constitution team answers numerous situations about desired AI behavior, and compares the collective answer embedding to the final constitution embedding of the final written constitution. If they are really close, we know the constitution is accurate, if far, we’ve lost information.
Since scale and diversity are critical to capability training and generalization, I argue they are for alignment as well. The natural way to increase both is polling as many people in the world as possible on AI behavior scenarios, not quite the ideal of Eliezer’s CEV but obtainable. Critically however, we do not average opinions, but keep each distinct.
If each individual desired outcome(wish) is represented by a vector v_i, greatly simplifying we can decompose v_i into two directions:
h_i—how much this wish benefits humanity
s_i—how much this wish benefits the self
Combining all vectors, I would hope selfish directions tend to cancel out, while humanity beneficial vectors aggregate. Of course, only assuming each wish is net humanity positive. Any entity wishing to align an AI to itself naturally conflicts against every other entity with the same wish.
I’m not sure how this alignment dataset would be implemented. I’m not a huge fan, but the concept is similar to a blockchain coordinating untrusting actors, with each block storing a file. Each block depositor pays a minimum gas fee but also critically can burn any amount of money to scale the influence of their block in the training dataset. The chain becomes a growing dataset of humanity’s alignment wishes.
At training time we can still use RLAIF. For a generated response, we sample many individual wishes from the dataset, and use a LLM judge or embedding model to score the response’s alignment against each wish scaled by magnitude, then average the scores. Ideally would want more elegant mechanism closer to pretraining than a reward model.
There are some good incentives:
The blockchain is maximally inspectable, with powerful actors working against each other.
Although the participation cost floor is minimal, real influence requires skin in the game via token burn.
People can copy and adapt blocks to their exact view, instead of voting for a representative of some of their views.
Game theory somewhat discourages “wasting” money on obvious cancellable directions
Content is public, but authorship is private. Social discourse pressures future blocks towards defensible positions.
Private authorship supports dissenting opinions from inside a controlling social structure like government, company, or even family.
Blockchain transactions are timestamped if recency should be prioritized.
Attempting to address negative incentives:
Influence is pure wealth, not the individual identity e.g. worldcoin. However, ensuring no manipulation between identity verification and submission is very difficult. Also, identity based systems tend to emphasize routing manipulation though proxies instead.
Wealthy actors have greater influence and will collude, but I would argue this already happens, just not as visibly, similar to Polymarket insider trading.
The dataset is naturally biased towards countries with internet access and tech savvy individuals, but the current situation is that people not in those groups already had minimal impact on AI policy.
Without individual identity, blocks are the atomic unit and must be diverse enough to cancel out noise.
I’ve likely missed many gaps and would greatly appreciate feedback.