Thanks for the empirical check! Nodding along to whatever someone says in a positive light is pretty easy and doesn’t reflect much about the values of the model. And, as you say, its a small model.
I believe Personaless Alignment would be a good research direction if it could be made tractable. I’m just struggling how it could be made tractable. I’d appreciate your thoughts on why you think it won’t be a good research direction.
Yeah, with neutral framings, Talkie says that the gays should go to mental asylums and not jails, Don’t Ask, Don’t Tell is bad because gays undermine military discipline, and incest is worse than homosexuality, but with positive framing he says that gayness is just a harmless vice, that public displays of affection are unacceptable regardless of sexuality, but that gays should still not be drafted or allowed to volunteer for the military. That sounds slightly more coherent listed out like that than the conversation made it seem.
As for personaless alignment, I believe we will get ASI in the short-term from techniques broadly similar to our current AI-training techniques and that personas arise almost automatically from current techniques. Therefore there isn’t time for an alignment technique that seems so incompatible with our current pipeline to be tested, debugged, and made standard.
Thank you for the clarification on Talkie behavior.
“We need Personaless Alignment” and “We don’t have time for Personaless Alignment“ can be true at the same time.
I agree with you that we look to be on track for getting ASI with short term techniques. I agree that because of that, persona alignment is a good thing for people to work on.
I’m drawing attention to how neglected Personaless Alignment seems to be and why it would be valuable.
Hopefully researchers can figure out a way to study this quickly.
Thanks for the empirical check! Nodding along to whatever someone says in a positive light is pretty easy and doesn’t reflect much about the values of the model. And, as you say, its a small model.
I believe Personaless Alignment would be a good research direction if it could be made tractable. I’m just struggling how it could be made tractable. I’d appreciate your thoughts on why you think it won’t be a good research direction.
Yeah, with neutral framings, Talkie says that the gays should go to mental asylums and not jails, Don’t Ask, Don’t Tell is bad because gays undermine military discipline, and incest is worse than homosexuality, but with positive framing he says that gayness is just a harmless vice, that public displays of affection are unacceptable regardless of sexuality, but that gays should still not be drafted or allowed to volunteer for the military. That sounds slightly more coherent listed out like that than the conversation made it seem.
As for personaless alignment, I believe we will get ASI in the short-term from techniques broadly similar to our current AI-training techniques and that personas arise almost automatically from current techniques. Therefore there isn’t time for an alignment technique that seems so incompatible with our current pipeline to be tested, debugged, and made standard.
Thank you for the clarification on Talkie behavior.
“We need Personaless Alignment” and “We don’t have time for Personaless Alignment“ can be true at the same time.
I agree with you that we look to be on track for getting ASI with short term techniques. I agree that because of that, persona alignment is a good thing for people to work on.
I’m drawing attention to how neglected Personaless Alignment seems to be and why it would be valuable.
Hopefully researchers can figure out a way to study this quickly.