99% of random[3] reversible circuits , no such exists.
Do you mean 99% of circuits that don’t satisfy P? Because there probably are distributions of random reversible circuits that satisfy P exactly 1% of the time, and that would make V’s job as hard as NP = coNP.
I’ve talked to a lot of people about mech interp so I can enumerate some counterarguments. Generally I’ve been surprised by how well people in AI safety can defend their own research agendas. Of course, deciding whether the counterarguments outweigh your arguments is a lot harder than just listing them, so that’ll be an exercise for readers.
I think researchers already believe this. Recently I read https://www.darioamodei.com/post/the-urgency-of-interpretability, and in it, Dario expects mech interp to take 5-10 years before it’s as good as an MRI.
Forall quantifiers are nice, but a lot of empirical sciences like medicine or economics have been pretty successful without them. We don’t really know how most drugs work, and the only way to soundly disprove a claim like “this drug will cause people to mysteriously drop dead 20 years later” is to run a 20 year study. We approve new drugs in less than 20 years, and we haven’t mysteriously dropped dead yet.
Similarly we can do a lot in mech interp to build confidence without any forall quantifiers, like building deliberately misaligned models and seeing if mech interp techniques can find the misalignment.
The people I’ve talked to believe interp will be generally helpful for all types of plans, and I haven’t heard anything specific either. Here’s a specific plan I made up. Hopefully it doesn’t suck.
Basically just combine prosaic alignment and mech interp. This might sound stupid on paper, most forbidden technique and whatnot, but using mech interp we can continuously make misalignment harder and keep it above the capability levels of frontier AIs. This might not work long term, but all long term alignment plans seem like moonshots right now, and we’ll have much better ideas later on when we know more about AIs (e.g. after we solve mech interp!).
Transformers haven’t changed much in the past 7 years, and big companies have already invested a ton of money into transformer specific performance optimizations. I just talked to some guys at a startup who spent hundreds of millions building a chip that can only run transformer inference. I think lots of people believe transformers will be around for a while. Also, it’s somewhat of a self fulfilling prophecy because new architectures now have to compete against hyperoptimized transformers, not just regular transformers.