I do, which is why I’ve always placed much more emphasis on figuring out how to do automated AI safety research as safely as we can, rather than trying to come up with some techniques that seem useful at the current scale but will ultimately be a weak proxy (but are good for gaining reputation in and out of the community, cause it looks legit).
That said, I think one of the best things we can hope for is that these techniques at least help us to safely get useful alignment research in the lead up to where it all breaks and that it allows us to figure out better techniques that do scale for the next generation while also having a good safety-usefulness tradeoff.
rather than trying to come up with some techniques that seem useful at the current scale but will ultimately be a weak proxy
To clarify, this means you don’t hold the position I expressed. On the view I expressed, experiments using weak proxies are worthwhile even though they aren’t very informative
Hmm, so I still hold the view that they are worthwhile even if they are not informative, particularly for the reasons you seem to have pointed to (i.e. training up good human researchers to identify who has a knack for a specific style of research s.t. we can use them for providing initial directions to AIs automating AI safety R&D as well as serving as model output verifiers OR building infra that ends up being used by AIs that are good enough to do tons of experiments leveraging that infra but not good enough to come up with completely new paradigms).
I do, which is why I’ve always placed much more emphasis on figuring out how to do automated AI safety research as safely as we can, rather than trying to come up with some techniques that seem useful at the current scale but will ultimately be a weak proxy (but are good for gaining reputation in and out of the community, cause it looks legit).
That said, I think one of the best things we can hope for is that these techniques at least help us to safely get useful alignment research in the lead up to where it all breaks and that it allows us to figure out better techniques that do scale for the next generation while also having a good safety-usefulness tradeoff.
To clarify, this means you don’t hold the position I expressed. On the view I expressed, experiments using weak proxies are worthwhile even though they aren’t very informative
Hmm, so I still hold the view that they are worthwhile even if they are not informative, particularly for the reasons you seem to have pointed to (i.e. training up good human researchers to identify who has a knack for a specific style of research s.t. we can use them for providing initial directions to AIs automating AI safety R&D as well as serving as model output verifiers OR building infra that ends up being used by AIs that are good enough to do tons of experiments leveraging that infra but not good enough to come up with completely new paradigms).