Mild caveat: the whole “I joined a group chat at NeurIPS and others kept joining to talk to us” only happens if you’re at NeurIPS and your name is Leo Gao so YMMV.
robertzk
robertzk’s Shortform
At this point I would direct the “deferred task” apparatus fully towards interventional interpretability. Put a moratorium on further gradient-based training, which is not well understood and can have many indirect effects unless you have some understanding of modularity and have applied stop gradients almost everywhere that is irrelevant to the generator of the conditional, deceptive reasoning behavior. Instead halt, melt and catch fire at that point.
Halt further model deployments towards the original deferred task. Quarantine the model that first exhibited the reasoning and promote it to a model organism (never deployed further, only studied by other earlier models). Direct the apparatus (from earlier checkpoints before you reached this stage) towards surgically reducing it while preserving mechanisms relevant to the original deferred task.
In the frame of enumerative safety, this would look something like identifying which circuits and features are minimal to the original deferred task (though this hinges upon task-minimal circuits being deception-free), evaluating the identified regions for mechanism preservation, localization and minimality and run-time ablating the rest.
There is probably a much better frame, but this is a new start to the “induction base case” in a completely separate invocation of the original apparatus. There are other permutations of spinning up new deferred tasks, such as earlier models applying Microscope AI on the quarantined model organism to gain insight on how we got to this point, without any direct attempt to remediate the issue, which as you note is “unlikely to get rid of it” despite one naive attempt above.
If your spinoff deferred tasks lead to the procurement of provable statements or heuristic arguments that the quarantined model cannot be minimized without “avoiding this kind of conditional, deceptive reasoning,” abandon the original deferred task and switch to a different plan. Otherwise, restart the apparatus towards the original deferred task when you have the proof artifacts.
There are a lot of concerns you could raise with this additional structure but it seems like a distinct problem that requires a separate rebuttal rather than a hard stop fail? The obvious one is that these sorts of spinoff deferred tasks could be harder than the original task and consistently lead to the same failure mode, a la “exception thrown while handling previous exception.”
This bounty went to: Victor Levoso.
SAEs are highly dataset dependent: a case study on the refusal direction
Open Source Replication of Anthropic’s Crosscoder paper for model-diffing
Base LLMs refuse too
SAEs (usually) Transfer Between Base and Chat Models
Attention Output SAEs Improve Circuit Analysis
We Inspected Every Head In GPT-2 Small using SAEs So You Don’t Have To
Attention SAEs Scale to GPT-2 Small
Sparse Autoencoders Work on Attention Layer Outputs
I am also in NYC and happy to participate. My lichess rating is around 2200 rapid and 2300 blitz.
Training Process Transparency through Gradient Interpretability: Early experiments on toy language models
Getting up to Speed on the Speed Prior in 2022
Emily Brontë on: Psychology Required for Serious™ AGI Safety Research
Thank you, Larks! Salute. FYI that I am at least one who has informally committed (see below) to take up this mantle. When would the next one typically be due?
https://twitter.com/robertzzk/status/1564830647344136192?s=20&t=efkN2WLf5Sbure_zSdyWUw
Inspecting code against a harm detection predicate seems recursive. What if the code or execution necessary to perform that inspection properly itself is harmful? An AGI is almost certainly a distributed system with no meaningful notion of global state, so I doubt this can be handwaved away.
For example, a lot of distributed database vendors, like Snowflake, do not offer a pre-execution query planner. This can only be performed just-in-time as the query runs or retroactively after it has completed, as the exact structure may be dependent on co-location of data and computation that is not apparent until the data referenced by the query is examined. Moreover, getting an accurate dry-run query plan may be as expensive as executing the query itself.
By analogy, for certain kinds of complex inspection procedures you envision, executing the inspection itself thoroughly enough to be reflective of the true execution risk may be as complex and as great of a risk of being harmful according to its values.
I am interested as well. Please share the docs in question with my LW username at gmail dot com if that is a possibility. Thank you!
Injecting a static IP that you control to a plethora of “whitelisting tutorials” all over the internet is a great example of exploiting data poisoning (e.g. https://www.lakera.ai/blog/training-data-poisoning) attacks, especially once the models pick up the data and are applied to autonomous devsecops use cases to conduct IP whitelisting over Terraform or automated devops-related MCPs.
This can be made more pernicious when you control the server (e.g. not just a substack post controlled by Substack, but active control over the hosting server), because you can inject the malicious static IP selectively depending upon whether or not the User-Agent is a scraping bot dispatched by an entity conducting model training.
One way that labs may look preemptively for adversarial cybersecurity actors is to scan their training data for contexts related to IP whitelisting / security whitelisting, and then have scanning agents examine whether there is any of the above funny business. After all, once the data has been indexed, it is static, and is no longer within the control of the malicious actor.
Final addendum: Models that have gone rogue and/or failed due to misalignment risk, rather than succumbed to misuse risk via this data poisoning avenue, could also take advantage of this by injecting whitelisted IPs into specific situations when activated as a coding agent to do so. It may be helpful for (1) interpretability researchers to actively catalogue and track features related to, particularly in open-weight models where this can be even more pernicious given that anyone can download models from HuggingFace, and (2) for control researchers to operationalize monitoring probes for these monitoring context related to IP whitelisting (or anomalous features activating within residual streams on top of or adjacent to IPs intended for whitelisted, such as in generated code diffs to a .tf file).