Thomas Kwa comments on Thomas Kwa’s Shortform

Thomas Kwa 16 Nov 2023 20:05 UTC
3 points
0
Thomas Kwa
I think the dialogues functionality might be suitable for monologues / journals, so I’m trying it out as a research log. I intend to make ~daily entries. Hopefully this gives people some idea of my research process, and lets them either give me feedback or compare their process to mine.
Currently I have a bunch of disconnected projects:
Characterize planning inside KataGo by retargeting it to output the worst move (with Adria Garriga-Alonso).
Improve circuit discovery by implementing edge-level subnetwork probing on sparse autoencoder features (with Adria and David Udell).
Create a tutorial for using TransformerLens on arbitrary (e.g. non-transformer) models by extending `HookedRootModule`, which could make it easy to use TransformerLens for e.g. ARENA 3.0 projects.
Create proofs for the accuracy of small neural nets in Coq (with Jason Gross and Rajashree Agrawal).
Create demonstrations of catastrophic regressional Goodhart and possibly strengthen theoretical results.
Help Peter Barnett and Jeremy Gillen wrap up some threads from MIRI, including editing an argument that misaligned mesa-optimizers are very likely.
I plan to mostly write about the first three, but might write about any of these if it doesn’t make things too disorganized.
Thomas Kwa
Yesterday (Monday ¹¹⁄₁₃) I did SERI MATS applications, and thought about Goodhart, but most of my time was spent on the KataGo project. I might more about it later, but the idea is to characterize the nature of planning in KataGo.
Early training runs had produced promising results—a remarkably sparse mask lets the network output almost the worst possible move as judged by the value network—but I was a bit suspicious that the hooked network was implementing some trivial behavior, like always moving in the corner. I adapted some visualization code previously used for FAR’s adversarial Go attack paper to see what the policy was doing, and well...
Turns out the network is doing the trivial behavior of moving in either the top left or bottom left corner. I wish I had checked this earlier (it took ~14 days of work to get here), but it doesn’t kill this project—I can just redo the next training run to only allow moves on the 3rd line or above, and hopefully the worst behavior here won’t be so trivial.
Today I’m going to table this project and start implementing edge-level subnetwork probing—estimate is 2 days for the algorithm and maybe lots more effort to run benchmarks.
Thomas Kwa
Updates from the last two days (11/14 and ¹¹⁄₁₅):
I finished the basic edge-level subnetwork probing code over the last two days. This is exciting because it might outperform ACDC and even attribution patching for circuit discovery. The original ACDC paper included a version of subnetwork probing, but that version was severely handicapped because it operated on the node level (structural pruning) rather than edge level.

Adria is now on vacation, so I’m planning to get as far as I can running experiments before getting stuck somewhere and coming back to this after Thanksgiving.
In the ACDC paper, node-level subnetwork probing (SP) is already competitive with ACDC; can edge-level SP beat ACDC?
I’m starting to think about applications of circuit discovery to unlearning / task erasure. If we take a sparse circuit for some task found by some circuit discovery method, and ablate the circuit, can it remove the model’s ability to do that task better than other methods like task vectors?
Last night ¹¹⁄₁₄, Aryan Bhatt and I thought of a counterexample to one idea Drake had for extending our Goodhart results. This is disappointing because it means beating Goodhart is not as easy as having light-tailed error.
I’m going to spend the next two days on MATS applications and syncing back up with the interp proofs project.