Thanks! Will reply to some different bits separately. First, on reddit-karma training:
I imagine the easiest thing would be to pre-pend the karma to each post, fine-tune the model, then you can generate high-karma posts by just prompting with “Karma 1000: …”.
This doesn’t accomplish what I’m going for (probably). The key thing I want is to directly reward GPT disproportionately in different circumstances. As I currently understand it, every situation for GPT is identical – bunch of previous words, one more word to predict, graded on that one word.
GPT never accidentally touches a burning hot stove, or gets a delicious meal, or builds up a complicated web of social rewards that they aspire to succeed at. I bet toddlers learn not to touch hot stoves very quickly even without parental supervision, faster than GPT could.
I don’t want “1 karma”, “10 karma” and “100 karma” to be a few different words with different associations. I want 10 karma to be 10x the reward of 1 karma, and 100 karma 10x that. (Well, maybe not literally 10x, I’d fine tune the reward structure with some fancy math)
When GPT-3 sort of struggles to figure out “I’m supposed to be doing addition or multiplication here”, I want to be able to directly punish or reward it more than it usually is.
Well, sure, you could take bigger gradient-descent steps for some errors than others. I’m not aware of people doing that, but again, I haven’t checked. I don’t know how well that would work (if at all).
The thing you’re talking about here sounds to me like “a means to an end” rather than “an end in itself”, right? If writing “Karma 100000: …” creates the high-karma-ish answer we wanted, does it matter that we didn’t use rewards to get there? I mean, if you want algorithmic differences between Transformers and brains, there are loads of them, I could go on and on! To me, the interesting question raised by this post is: to what extent can they do similar things, even if they’re doing it in very different ways? :-)
Thanks! Will reply to some different bits separately. First, on reddit-karma training:
This doesn’t accomplish what I’m going for (probably). The key thing I want is to directly reward GPT disproportionately in different circumstances. As I currently understand it, every situation for GPT is identical – bunch of previous words, one more word to predict, graded on that one word.
GPT never accidentally touches a burning hot stove, or gets a delicious meal, or builds up a complicated web of social rewards that they aspire to succeed at. I bet toddlers learn not to touch hot stoves very quickly even without parental supervision, faster than GPT could.
I don’t want “1 karma”, “10 karma” and “100 karma” to be a few different words with different associations. I want 10 karma to be 10x the reward of 1 karma, and 100 karma 10x that. (Well, maybe not literally 10x, I’d fine tune the reward structure with some fancy math)
When GPT-3 sort of struggles to figure out “I’m supposed to be doing addition or multiplication here”, I want to be able to directly punish or reward it more than it usually is.
Well, sure, you could take bigger gradient-descent steps for some errors than others. I’m not aware of people doing that, but again, I haven’t checked. I don’t know how well that would work (if at all).
The thing you’re talking about here sounds to me like “a means to an end” rather than “an end in itself”, right? If writing “Karma 100000: …” creates the high-karma-ish answer we wanted, does it matter that we didn’t use rewards to get there? I mean, if you want algorithmic differences between Transformers and brains, there are loads of them, I could go on and on! To me, the interesting question raised by this post is: to what extent can they do similar things, even if they’re doing it in very different ways? :-)