I think that there’s probably a better framing to this argument than “gradient routing is better than pretraining filtering.” Assuming it’s a binary decision and both are equally feasible for the purpose of safety:
For frontier/near-frontier labs, the question that determines which approach they’ll use is probably how much they care about performance. Here, gradient routing is worse; its forget loss was shown to be inversely correlated with its retain loss, unlike data filtering. Granted, it wasn’t a huge difference, but the tests were also on a very small transformer.
For open-weight models, there’s a second level. We know that safely-pretrained models largely resist re-learning attacks and have other benefits. I’m not aware of any similar tests for gradient-routed models, but based on table 1 (the 20 virology token routing results), it seems that partially-labeled gradient-routing is still quite behind the deep ignorance approach. Of course, the gap likely closes as we approach 100% labeling for the gradient routing, but at that point performance once again becomes the deciding factor, where data filtering probably wins out.
The main benefit of gradient routing, in my opinion, is that it is a feasible option much more often than data filtering, i.e., partially-labeled gradient routing is significantly easier than safe pretraining and still has a lot of benefits. If we’re forcing a direct comparison, though, I don’t think it’s fair to say that gradient routing is strictly better than data filtering (for safety purposes at least; gradient routing is still interesting for research, e.g., the subset of dangerous capabilities example).
I think doing the math helps here. I’ll use the NYC subway for reference (not the busiest metro system in the world, but definitely not the least).
An NYC subway car can hold around 200-250 people, so we’ll say a 10-car train has roughly 2000–2500 people per train. Two minute headways (about as tight as it gets for NYC) gives a capacity of ~60,000–75,000 people per hour (PpH).
Now assume we want to recreate that with articulated busses, which can hold around 150 people. If we’re completely bypassing traffic by literally replacing the rails in a subway with a road, to match 60k PpH we need 400 busses per hour, or a bus every 9 seconds.
There are a lot of logistical problems with a bus leaving every 9 seconds, and problems with using busses to begin with.
Instead of one operator per 2000 people in the train setting, we have 13-14 operators for 2000 people with the busses. This creates a lot more surface area for error and coordination problems. We’d probably run into all the normal issues with roads (e.g. phantom traffic jams, which could completely destroy the 9-second headways) or have to create very large stations to allow easy bypasses when departing.
A single lane of busses simply could not transport people as quickly as a subway due to every individual bus requiring its own stopping distance buffer. So we’d need multiple lanes, on top of bigger and more expensive tunnels due to combustion ventilation requirements.
While the cost of manufacturing 13-14 busses is actually comparable to a single train, our operator salary goes up an order of magnitude.
The NYC subway has rolling stock dating back to the 70s and 80s. The lifespan of a bus, on the other hand, is typically only 15 years.
How could we improve this system? One first step could be using electric or trolley busses, letting us accelerate faster and getting rid of combustion concerns. The next obvious idea is using self-driving busses, removing the human element of error. But logistically, we’re still left with the capacity issue and the insane amount of busses required to run this system.
So maybe we further run our driverless busses in 10-bus platoons. Now we only have to run one platoon every 2 minutes—much easier to coordinate—and stopping distances become less of a concern with these longer headways. We could then completely eliminate stopping distances for all but the first bus in the platoon by chaining the busses together (it could even become reasonable to reintroduce operators, as we now only need one per chain). But then we’ve just recreated a rubber-tired metro. To save energy we might switch to steel wheels; now we’ve recreated the NYC subway from first principles.
I definitely agree that there are a lot of transit situations where e.g., robust BRT is more than enough and rail is overkill, but for the raw people-moving capacity needed by major cities, trains almost always win.