Yeah, I was thinking of reward hacking as another example of a problem we can solve if we try but companies aren’t prioritizing it, which isn’t a huge deal at the moment but could be very bad if the AIs were much smarter and more power-seeking.
Stepping back, there’s a worldview where any weird, undesired behavior no matter how minor is scary because we need to get alignment perfectly right; and another where we should worry about scheming, deception, and related behaviors but it’s not a big deal (at least safety-wise) if the model misunderstands our instructions in bizarre ways. Either of these can be justified but this discussion could probably use more clarity about which one we’re all coming from.
Yeah, I was thinking of reward hacking as another example of a problem we can solve if we try but companies aren’t prioritizing it, which isn’t a huge deal at the moment but could be very bad if the AIs were much smarter and more power-seeking.
Stepping back, there’s a worldview where any weird, undesired behavior no matter how minor is scary because we need to get alignment perfectly right; and another where we should worry about scheming, deception, and related behaviors but it’s not a big deal (at least safety-wise) if the model misunderstands our instructions in bizarre ways. Either of these can be justified but this discussion could probably use more clarity about which one we’re all coming from.