Awesome post here! Thank you for talking about the importance of ensuring control mechanisms are in place in different areas of AI research.
We want safety techniques that are very hard for models to subvert.
I think that a safety technique that changes all of the model weights is hard (or impossible?) for modelsthe same model to subvert. In this regard, other safety techniques that do not consider controlling 100% of the network (eg. activation engineering) will not scale.
Awesome post here! Thank you for talking about the importance of ensuring control mechanisms are in place in different areas of AI research.
I think that a safety technique that changes all of the model weights is hard (or impossible?) for
modelsthe same model to subvert. In this regard, other safety techniques that do not consider controlling 100% of the network (eg. activation engineering) will not scale.Changed “models” to “the same model” for clarity.