I do agree that mechanistic interpretability is important, but given my limited time, I focused on creating the best test model (modGPT2XL) first before embarking on it. Past builds didn’t reach the same level of generalizability as the one I used in this post. I will be moving on to this focused interpretability work this month.
Crafting billions of tokens of training data would be even more expensive than the cost of training alone. It would also require more time, more quality assurance effort, and more study/research time to analyze the results.
I really have thought of this a ton and erring on the side that there is a pareto ratio that guides these distributional shifts. I don’t have a proof of this yet and probably that is a work for a bigger team yet this project alone was able to use a 2.9MB file to shift a 6GB (1.5 billion params) model to respond better—suggesting that there is such a data encoding / processing method that can extract features and deliver it models.
There is no guarantee that artificially crafted training data would prove out to have a meaningful impact on behavior. We can’t know if the Waluigi Effect is because of the training data, or inherent in the GPT itself. (See con #1)
Indeed, solid interpretability work is necessary for ATL’s case. However, I find that devoting my time to interpretability, without targeting neurons that don’t exhibit indications of “alignment properties”, is not appealing. Once again, I’m taking a step-by-step approach to alignment—targeting core (robust) concepts that transfer to models and then, yes, conducting interpretability research on its activated neurons or the aggregate shifts in parameters.
I question the applicability of CDT/FDT to a GPT. I am not an expert in either CDT/FDT but a cursory familiarization suggests to me that these theories primarily are aimed at autonomous agents. So there’s a functional/capability gap between the GPT and the proposal (above) that seems not fully addressed.
I feel the same way. Some have argued against FDT, but as I have explained in this post, FDT represents the Decision Theory that captures the alignment problem effectively.
Likewise, it does not follow for me that just because you manage to get token prediction that is more preferred by humans (and seems more aligned) than you get from raw training data on the internet, that this improved preference for token prediction translates to alignment. (However, given the current lack of solution to the alignment problem, it also does not seem like it would hurt progress in that area.)[1]
Many have criticized me for this repeatedly, but I can’t just turn a blind eye and dismiss the outputs as lies. Instead, I view these responses as starting points for future interpretability work.
Hi Maz,
Thanks for commenting on this exploratory post.
To answer some of your comments:
I do agree that mechanistic interpretability is important, but given my limited time, I focused on creating the best test model (modGPT2XL) first before embarking on it. Past builds didn’t reach the same level of generalizability as the one I used in this post. I will be moving on to this focused interpretability work this month.
I really have thought of this a ton and erring on the side that there is a pareto ratio that guides these distributional shifts. I don’t have a proof of this yet and probably that is a work for a bigger team yet this project alone was able to use a 2.9MB file to shift a 6GB (1.5 billion params) model to respond better—suggesting that there is such a data encoding / processing method that can extract features and deliver it models.
Indeed, solid interpretability work is necessary for ATL’s case. However, I find that devoting my time to interpretability, without targeting neurons that don’t exhibit indications of “alignment properties”, is not appealing. Once again, I’m taking a step-by-step approach to alignment—targeting core (robust) concepts that transfer to models and then, yes, conducting interpretability research on its activated neurons or the aggregate shifts in parameters.
I feel the same way. Some have argued against FDT, but as I have explained in this post, FDT represents the Decision Theory that captures the alignment problem effectively.
Many have criticized me for this repeatedly, but I can’t just turn a blind eye and dismiss the outputs as lies. Instead, I view these responses as starting points for future interpretability work.
Again, I appreciate the comments. Thanks!