Thanks for the reply. I agree that it would be exciting in itself to create “a formal framework for heuristic arguments that is well-developed enough that we can convincingly apply it to neural networks”, and I agree that for that goal, LPE and MAD are more of a test case than a necessary element. However, I think you probably can’t get rid of the question of empirical regularities.
I think you certainly need to resolve the question of empirical regularities if you want to apply your methods to arbitrary neural networks, and I strongly suspect that you need to do something like solving empirical regularities even if you only want to create an explanation finding method that works on a diverse range of neural networks solving simple algorithmic tasks. I agree that empirical regularities are not necessarily needed if you only want to explain specific neural nets solving algorithmic tasks one by one, but I’m not that excited about that.
Can you say an example of a goal that would not require resolving the question of empirical regularities, but would still deserve to be called “a formal framework for heuristic arguments that is well-developed enough that we can convincingly apply it to neural networks” an you would expect it to produce “quiet a lot” of applications? I don’t really see that without making significant headway with empirical regularities.
I thought about this a bit more (and discussed with others) and decided that you are basically right that we can’t avoid the question of empirical regularities for any realistic alignment application, if only because any realistic model with potential alignment challenges will be trained on empirical data. The only potential application we came up with is LPE for a formalized distribution and formalized catastrophe event, but we didn’t find this especially compelling, for several reasons.[1]
To me the challenges we face in dealing with empirical regularities do not seem bigger than the challenges we face with formal heuristic explanations, but the empirical regularities challenges should become much more concrete once we have a notion of heuristic explanations to work with, so it seems easier to resolve them in that order. But I have moved in your direction, and it does seem worth our while to address them both in parallel to some extent.
Objections include: (a) the model is trained on empirical data, so we need to only explain things relevant to formal events, and not everything relevant to its loss; (b) we also need to hope that empirical regularities aren’t needed to explain purely formal events, which remains unclear; and (c) the restriction to formal distributions/events limits the value of the application.
Thanks for the reply. I agree that it would be exciting in itself to create “a formal framework for heuristic arguments that is well-developed enough that we can convincingly apply it to neural networks”, and I agree that for that goal, LPE and MAD are more of a test case than a necessary element. However, I think you probably can’t get rid of the question of empirical regularities.
I think you certainly need to resolve the question of empirical regularities if you want to apply your methods to arbitrary neural networks, and I strongly suspect that you need to do something like solving empirical regularities even if you only want to create an explanation finding method that works on a diverse range of neural networks solving simple algorithmic tasks. I agree that empirical regularities are not necessarily needed if you only want to explain specific neural nets solving algorithmic tasks one by one, but I’m not that excited about that.
Can you say an example of a goal that would not require resolving the question of empirical regularities, but would still deserve to be called “a formal framework for heuristic arguments that is well-developed enough that we can convincingly apply it to neural networks” an you would expect it to produce “quiet a lot” of applications? I don’t really see that without making significant headway with empirical regularities.
I thought about this a bit more (and discussed with others) and decided that you are basically right that we can’t avoid the question of empirical regularities for any realistic alignment application, if only because any realistic model with potential alignment challenges will be trained on empirical data. The only potential application we came up with is LPE for a formalized distribution and formalized catastrophe event, but we didn’t find this especially compelling, for several reasons.[1]
To me the challenges we face in dealing with empirical regularities do not seem bigger than the challenges we face with formal heuristic explanations, but the empirical regularities challenges should become much more concrete once we have a notion of heuristic explanations to work with, so it seems easier to resolve them in that order. But I have moved in your direction, and it does seem worth our while to address them both in parallel to some extent.
Objections include: (a) the model is trained on empirical data, so we need to only explain things relevant to formal events, and not everything relevant to its loss; (b) we also need to hope that empirical regularities aren’t needed to explain purely formal events, which remains unclear; and (c) the restriction to formal distributions/events limits the value of the application.