Further black-box work would be valuable to better characterise the dataset attributes which cause EM: Is a certain level of implied harm in the data necessary?
This seems like a really valuable direction! How ‘non-evil’ can the errors be and still cause EM? Would training to output errors to math problems be enough? How about training to sandbag on some particular kind of test? What about training to produce low-quality essays instead of high-quality? Or having unconventional opinions on some topic (the aesthetic qualities of trees, say, or the importance of some historical figures)? It seems to me that starting to find these boundaries could tell us a lot about what (if anything) is happening at a human-interpretable level, eg whether the ‘performatively evil’ interpretation of EM is reasonable.
This seems like a really valuable direction! How ‘non-evil’ can the errors be and still cause EM? Would training to output errors to math problems be enough? How about training to sandbag on some particular kind of test? What about training to produce low-quality essays instead of high-quality? Or having unconventional opinions on some topic (the aesthetic qualities of trees, say, or the importance of some historical figures)? It seems to me that starting to find these boundaries could tell us a lot about what (if anything) is happening at a human-interpretable level, eg whether the ‘performatively evil’ interpretation of EM is reasonable.