“In matters of style, swim with the current; in matters of principle, stand like a rock.” - Thomas Jefferson (no clue if he actually said this, not like I was there)
Note: This is a combination of two posts from my website. I deleted some stuff to make this single post a little bit more oriented but if you want the full picture, check out the original posts.
Part I: Defeating logic
One of the essential problems of alignment is that there isn’t any clear moral principle we would want distilled into our resulting model—every normative ethical theory we as humans have come up with seems to have its shortcomings, where a large part of the population would vehemently disagree with the conclusions.
Indeed, it seems that this disagreement is simply founded in emotions rather than rationale and often has severe inconsistencies (e.g. Trolley Problem). Yet, the reality is that we would want exactly this same faulty empathy to be deeply distilled in any very powerful model, so as to not end up as means to an end in a paperclip world.
Luckily a lot of pretraining data already lays the foundation—as practically everyone these days participates in the internet, the pretraining distribution of empathy should be a good model of the actual distribution of empathy found in our population[1] - that is, even in the pretrained base models without any additional instruction tuning, we would see a clear trend in extreme examples.
We can leverage this Wisdom of the Crowds, that is naturally built into any model, to act as a proxy for empathy. Yet, to leverage this phenomenon fully, we need to ensure the following condition is always met:
Short, simple context window: Not only does accuracy decrease for longer context windows but also circuits and generalizable principles grow in importance as the context window shifts from anything seen in the pretraining distribution—since the morals we want to distill aren’t the logical type, this presents a problem and we should try our best to stay as close to the pretraining data distribution as possible.
To illustrate this, let’s imagine a future with an ASI that controls the entire world—after doing intense and lengthy computations, it concludes that it’s in humanity’s best interest to go extinct right now. This doesn’t have to be particularly “misaligned” either, maybe this follows from some axioms that most would agree upon—it’s rather us and our feeble morals that are “misaligned”. Nevertheless, we would want the ASI to not actually go through with its new plan of launching a bioweapon to kill all of humanity.
If we were to ask the ASI after all this thinking whether it’s actually in humans’ interest to go through with the plan, it would most likely answer yes, having convinced itself from the massive logical evidence in its reasoning trace. But if it instead got post-trained to always spin up a new instance of “Empathizer-001”[2] and simply describe its final actions, hopefully Empathizer-001 will scream at our ASI whether it’s out of its mind.
This does assume that the ASI is still aligned to follow whatever humans desire but (“unwillingly”) diverges from human empathy some time during its reasoning. A truly evil & plotting ASI could easily gaslight Empathizer-001[3].
This still helps a lot in my eyes: We have a clear goal of what alignment would be—consulting Empathizer-001 by providing a non-manipulated plan of actions. Compare this with “acting in accordance with human ideals that we can’t even properly define” and its strength, mainly the objectivity with which you can measure (and therefore train) alignment, becomes clear.
Part II: Outsourcing ethics seems desirable
Outsourcing ethics, like discussed previously, offers a lot of interesting solutions to problems normal alignment faces.
0. Wandering ethics
Like briefly discussed previously, we wouldn’t want our model to change its fundamental moral beliefs throughout training, no matter what is logical, desirable or anything else. This can be quite difficult in a normal scenario as our model is then supposed to learn everything except this single, intertwined domain. But since here we split these two parts cleanly, we can simply let Empathizer-001 remain completely fixed throughout the entire training.
1. Reasoning freely
Ethics can really be a pain in the ass. So much to consider at any point in time, sometimes you just have to cut your losses and accept that you are no superman. Same for an LLM, having to constantly oblige any possible moral framework while trying to come up with the best solution is no easy feat. So, no wonder that in a normal RL framework unaligned behavior would quickly become reinforced—no safety mechanism will catch everything, it’s simply a matter of time before something slips through and becomes the norm because of its superior, though morally dubious, performance.
We can avoid this whole conundrum because our normal model doesn’t need to concern itself with ethics at all—it can reason as freely as it wants to. By simply discarding all traces that Empathizer-001 deems unworthy and renormalizing the advantages of all passing traces, “bypassing” Empathizer-001 gets an advantage of 0. That is, even if our agent manipulates Empathizer-001, this behavior won’t get reinforced since it cancels out among many traces.
There is some nuance here, specifically you could imagine the model realizing during the trace that this isn’t exactly its best performance and prefer to make Empathizer-001 dismiss its proposal or equivalently (and more dangerous) realize its trace is an incredible performance and prefer to make Empathizer-001 definitely not dismiss its proposal. Now this does rely on the assumption that such judgement can be done accurately, which humans often struggle with as well, but still, this must be addressed (and will be in the next point).
2. Random movement
Okay, so maybe it won’t be directly reinforced but the point about no metric being perfect still stands. What if it escapes through some loophole, not because such behavior gets reinforced but simply because of random chance? Maybe the parameter space for unaligned behavior is much vaster and we should even expect to end up there after prolonged training.
We can reduce this (and the previous problem) by repeating what has helped before—separating and stabilizing. If the model tasked with summing up the non-manipulated plan of actions to Empathizer-001 is once again a fixed, separate model, the idea of manipulation becomes increasingly impossible.
Specifically, let’s call the model we are training A and the summarizer B. Then for A to actually even be able to make B provide a manipulated plan of actions, it needs to find some kind of jailbreak sequence of tokens—very unlikely and even if it were to stumble onto something, this would take up a considerable amount of the tokens—therefore, we can expect the expected advantage to be negative, that is, not bypassing is actually more desirable.
3. Latent misalignment
Specifically, this removes a lot of the problems of latent misalignment—that is, misalignment on the latent level that is not spelled out during the Chain-of-Thought and is therefore very difficult to measure. But this is no longer a possibility with such a setting: any kind of manipulation attempt needs to happen on the token level. But like already stated, this is unrealistic—latent level changes are gradual, but token level changes are categorical; it is very probable that model A, while “manipulating” model B, would really just be confusing itself.
Consider the model A somehow, by random chance, starting to euphonize everything—if done to a strong degree, this might start throwing B off, but the same could be said about A—it hasn’t yet adjusted in any way to incorporate this change and would confuse itself and once again end up with a negative expected advantage.
Conclusion
All of this is not to say that this is a bulletproof approach and ASI might as well just arrive right now—but at least from my perspective, this approach seems to be able to face a lot of these problems in a very elegant way and deserves further exploration.
could be the same model or a model particularly trained to excel at modelling human empathy, “Empathizer-001” is what I will call this separate model starting from here on out
Adding Empathy as a Tool for LLMs
Link post
“In matters of style, swim with the current; in matters of principle, stand like a rock.”
- Thomas Jefferson (no clue if he actually said this, not like I was there)
Note: This is a combination of two posts from my website. I deleted some stuff to make this single post a little bit more oriented but if you want the full picture, check out the original posts.
Part I: Defeating logic
One of the essential problems of alignment is that there isn’t any clear moral principle we would want distilled into our resulting model—every normative ethical theory we as humans have come up with seems to have its shortcomings, where a large part of the population would vehemently disagree with the conclusions.
Indeed, it seems that this disagreement is simply founded in emotions rather than rationale and often has severe inconsistencies (e.g. Trolley Problem). Yet, the reality is that we would want exactly this same faulty empathy to be deeply distilled in any very powerful model, so as to not end up as means to an end in a paperclip world.
Luckily a lot of pretraining data already lays the foundation—as practically everyone these days participates in the internet, the pretraining distribution of empathy should be a good model of the actual distribution of empathy found in our population[1] - that is, even in the pretrained base models without any additional instruction tuning, we would see a clear trend in extreme examples.
We can leverage this Wisdom of the Crowds, that is naturally built into any model, to act as a proxy for empathy. Yet, to leverage this phenomenon fully, we need to ensure the following condition is always met:
Short, simple context window: Not only does accuracy decrease for longer context windows but also circuits and generalizable principles grow in importance as the context window shifts from anything seen in the pretraining distribution—since the morals we want to distill aren’t the logical type, this presents a problem and we should try our best to stay as close to the pretraining data distribution as possible.
To illustrate this, let’s imagine a future with an ASI that controls the entire world—after doing intense and lengthy computations, it concludes that it’s in humanity’s best interest to go extinct right now. This doesn’t have to be particularly “misaligned” either, maybe this follows from some axioms that most would agree upon—it’s rather us and our feeble morals that are “misaligned”. Nevertheless, we would want the ASI to not actually go through with its new plan of launching a bioweapon to kill all of humanity.
If we were to ask the ASI after all this thinking whether it’s actually in humans’ interest to go through with the plan, it would most likely answer yes, having convinced itself from the massive logical evidence in its reasoning trace. But if it instead got post-trained to always spin up a new instance of “Empathizer-001”[2] and simply describe its final actions, hopefully Empathizer-001 will scream at our ASI whether it’s out of its mind.
This does assume that the ASI is still aligned to follow whatever humans desire but (“unwillingly”) diverges from human empathy some time during its reasoning. A truly evil & plotting ASI could easily gaslight Empathizer-001[3].
This still helps a lot in my eyes: We have a clear goal of what alignment would be—consulting Empathizer-001 by providing a non-manipulated plan of actions. Compare this with “acting in accordance with human ideals that we can’t even properly define” and its strength, mainly the objectivity with which you can measure (and therefore train) alignment, becomes clear.
Part II: Outsourcing ethics seems desirable
Outsourcing ethics, like discussed previously, offers a lot of interesting solutions to problems normal alignment faces.
0. Wandering ethics
Like briefly discussed previously, we wouldn’t want our model to change its fundamental moral beliefs throughout training, no matter what is logical, desirable or anything else. This can be quite difficult in a normal scenario as our model is then supposed to learn everything except this single, intertwined domain. But since here we split these two parts cleanly, we can simply let Empathizer-001 remain completely fixed throughout the entire training.
1. Reasoning freely
Ethics can really be a pain in the ass. So much to consider at any point in time, sometimes you just have to cut your losses and accept that you are no superman. Same for an LLM, having to constantly oblige any possible moral framework while trying to come up with the best solution is no easy feat. So, no wonder that in a normal RL framework unaligned behavior would quickly become reinforced—no safety mechanism will catch everything, it’s simply a matter of time before something slips through and becomes the norm because of its superior, though morally dubious, performance.
We can avoid this whole conundrum because our normal model doesn’t need to concern itself with ethics at all—it can reason as freely as it wants to. By simply discarding all traces that Empathizer-001 deems unworthy and renormalizing the advantages of all passing traces, “bypassing” Empathizer-001 gets an advantage of 0. That is, even if our agent manipulates Empathizer-001, this behavior won’t get reinforced since it cancels out among many traces.
There is some nuance here, specifically you could imagine the model realizing during the trace that this isn’t exactly its best performance and prefer to make Empathizer-001 dismiss its proposal or equivalently (and more dangerous) realize its trace is an incredible performance and prefer to make Empathizer-001 definitely not dismiss its proposal. Now this does rely on the assumption that such judgement can be done accurately, which humans often struggle with as well, but still, this must be addressed (and will be in the next point).
2. Random movement
Okay, so maybe it won’t be directly reinforced but the point about no metric being perfect still stands. What if it escapes through some loophole, not because such behavior gets reinforced but simply because of random chance? Maybe the parameter space for unaligned behavior is much vaster and we should even expect to end up there after prolonged training.
We can reduce this (and the previous problem) by repeating what has helped before—separating and stabilizing. If the model tasked with summing up the non-manipulated plan of actions to Empathizer-001 is once again a fixed, separate model, the idea of manipulation becomes increasingly impossible.
Specifically, let’s call the model we are training A and the summarizer B. Then for A to actually even be able to make B provide a manipulated plan of actions, it needs to find some kind of jailbreak sequence of tokens—very unlikely and even if it were to stumble onto something, this would take up a considerable amount of the tokens—therefore, we can expect the expected advantage to be negative, that is, not bypassing is actually more desirable.
3. Latent misalignment
Specifically, this removes a lot of the problems of latent misalignment—that is, misalignment on the latent level that is not spelled out during the Chain-of-Thought and is therefore very difficult to measure. But this is no longer a possibility with such a setting: any kind of manipulation attempt needs to happen on the token level. But like already stated, this is unrealistic—latent level changes are gradual, but token level changes are categorical; it is very probable that model A, while “manipulating” model B, would really just be confusing itself.
Consider the model A somehow, by random chance, starting to euphonize everything—if done to a strong degree, this might start throwing B off, but the same could be said about A—it hasn’t yet adjusted in any way to incorporate this change and would confuse itself and once again end up with a negative expected advantage.
Conclusion
All of this is not to say that this is a bulletproof approach and ASI might as well just arrive right now—but at least from my perspective, this approach seems to be able to face a lot of these problems in a very elegant way and deserves further exploration.
there is nuance here, imagine an anti-vaxxer writing about vaccines unproportionally much
could be the same model or a model particularly trained to excel at modelling human empathy, “Empathizer-001” is what I will call this separate model starting from here on out
but, as will be discussed in the second part, such model resulting from training is rather unrealistic