robertzk

Karma: 679

robertzk’s Shortform

robertzk8 Dec 2025 17:27 UTC

5 points

2 comments1 min readLW link

robertzk 8 Dec 2025 17:27 UTC
2 points
0
on: robertzk’s Shortform
Injecting a static IP that you control to a plethora of “whitelisting tutorials” all over the internet is a great example of exploiting data poisoning (e.g. https://www.lakera.ai/blog/training-data-poisoning) attacks, especially once the models pick up the data and are applied to autonomous devsecops use cases to conduct IP whitelisting over Terraform or automated devops-related MCPs.
This can be made more pernicious when you control the server (e.g. not just a substack post controlled by Substack, but active control over the hosting server), because you can inject the malicious static IP selectively depending upon whether or not the User-Agent is a scraping bot dispatched by an entity conducting model training.
One way that labs may look preemptively for adversarial cybersecurity actors is to scan their training data for contexts related to IP whitelisting / security whitelisting, and then have scanning agents examine whether there is any of the above funny business. After all, once the data has been indexed, it is static, and is no longer within the control of the malicious actor.
Final addendum: Models that have gone rogue and/or failed due to misalignment risk, rather than succumbed to misuse risk via this data poisoning avenue, could also take advantage of this by injecting whitelisted IPs into specific situations when activated as a coding agent to do so. It may be helpful for (1) interpretability researchers to actively catalogue and track features related to, particularly in open-weight models where this can be even more pernicious given that anyone can download models from HuggingFace, and (2) for control researchers to operationalize monitoring probes for these monitoring context related to IP whitelisting (or anomalous features activating within residual streams on top of or adjacent to IPs intended for whitelisted, such as in generated code diffs to a .tf file).

robertzk 5 Dec 2025 17:29 UTC
LW: 3 AF: 2
0
AF
in reply to: leogao’s comment on: leogao’s Shortform
Mild caveat: the whole “I joined a group chat at NeurIPS and others kept joining to talk to us” only happens if you’re at NeurIPS and your name is Leo Gao so YMMV.

robertzk 21 Feb 2025 14:40 UTC
LW: 1 AF: 1
0
AF
in reply to: habryka’s comment on: How might we safely pass the buck to AI?
At this point I would direct the “deferred task” apparatus fully towards interventional interpretability. Put a moratorium on further gradient-based training, which is not well understood and can have many indirect effects unless you have some understanding of modularity and have applied stop gradients almost everywhere that is irrelevant to the generator of the conditional, deceptive reasoning behavior. Instead halt, melt and catch fire at that point.
Halt further model deployments towards the original deferred task. Quarantine the model that first exhibited the reasoning and promote it to a model organism (never deployed further, only studied by other earlier models). Direct the apparatus (from earlier checkpoints before you reached this stage) towards surgically reducing it while preserving mechanisms relevant to the original deferred task.
In the frame of enumerative safety, this would look something like identifying which circuits and features are minimal to the original deferred task (though this hinges upon task-minimal circuits being deception-free), evaluating the identified regions for mechanism preservation, localization and minimality and run-time ablating the rest.
There is probably a much better frame, but this is a new start to the “induction base case” in a completely separate invocation of the original apparatus. There are other permutations of spinning up new deferred tasks, such as earlier models applying Microscope AI on the quarantined model organism to gain insight on how we got to this point, without any direct attempt to remediate the issue, which as you note is “unlikely to get rid of it” despite one naive attempt above.
If your spinoff deferred tasks lead to the procurement of provable statements or heuristic arguments that the quarantined model cannot be minimized without “avoiding this kind of conditional, deceptive reasoning,” abandon the original deferred task and switch to a different plan. Otherwise, restart the apparatus towards the original deferred task when you have the proof artifacts.
There are a lot of concerns you could raise with this additional structure but it seems like a distinct problem that requires a separate rebuttal rather than a hard stop fail? The obvious one is that these sorts of spinoff deferred tasks could be harder than the original task and consistently lead to the same failure mode, a la “exception thrown while handling previous exception.”

robertzk 24 Jan 2025 10:08 UTC
3 points
2
in reply to: Keenan Pepper’s comment on: Attention Output SAEs Improve Circuit Analysis
This bounty went to: Victor Levoso.

SAEs are highly dataset dependent: a case study on the refusal direction

Connor Kissane, robertzk, Neel Nanda and Arthur Conmy

7 Nov 2024 5:22 UTC

67 points

4 comments14 min readLW link

Open Source Replication of Anthropic’s Crosscoder paper for model-diffing

Connor Kissane, robertzk, Arthur Conmy and Neel Nanda

27 Oct 2024 18:46 UTC

49 points

4 comments5 min readLW link

Base LLMs refuse too

Connor Kissane, robertzk, Arthur Conmy and Neel Nanda

29 Sep 2024 16:04 UTC

61 points

20 comments10 min readLW link

SAEs (usually) Transfer Between Base and Chat Models

Connor Kissane, robertzk, Arthur Conmy and Neel Nanda

18 Jul 2024 10:29 UTC

67 points

0 comments10 min readLW link

Attention Output SAEs Improve Circuit Analysis

Connor Kissane, robertzk, Arthur Conmy and Neel Nanda

21 Jun 2024 12:56 UTC

33 points

3 comments19 min readLW link

We Inspected Every Head In GPT-2 Small using SAEs So You Don’t Have To

robertzk, Connor Kissane, Arthur Conmy and Neel Nanda

6 Mar 2024 5:03 UTC

63 points

0 comments12 min readLW link

Attention SAEs Scale to GPT-2 Small

Connor Kissane, robertzk, Arthur Conmy and Neel Nanda

3 Feb 2024 6:50 UTC

78 points

4 comments8 min readLW link

Sparse Autoencoders Work on Attention Layer Outputs

Connor Kissane, robertzk, Arthur Conmy and Neel Nanda

16 Jan 2024 0:26 UTC

85 points

9 comments18 min readLW link

robertzk 29 Oct 2023 1:23 UTC
2 points
0
in reply to: aphyer’s comment on: Lying to chess players for alignment
I am also in NYC and happy to participate. My lichess rating is around 2200 rapid and 2300 blitz.

Training Process Transparency through Gradient Interpretability: Early experiments on toy language models

robertzk and evhub

21 Jul 2023 14:52 UTC

56 points

1 comment1 min readLW link

Getting up to Speed on the Speed Prior in 2022

robertzk28 Dec 2022 7:49 UTC

36 points

5 comments65 min readLW link

Emily Brontë on: Psychology Required for Serious™ AGI Safety Research

robertzk14 Sep 2022 14:47 UTC

2 points

0 comments1 min readLW link

robertzk 31 Aug 2022 4:23 UTC
5 points
0
in reply to: Larks’s comment on: Larks’s Shortform
Thank you, Larks! Salute. FYI that I am at least one who has informally committed (see below) to take up this mantle. When would the next one typically be due?

https://twitter.com/robertzzk/status/1564830647344136192?s=20&t=efkN2WLf5Sbure_zSdyWUw

robertzk 24 Mar 2019 17:34 UTC
1 point
0
in reply to: John_Maxwell’s comment on: The Main Sources of AI Risk?
Inspecting code against a harm detection predicate seems recursive. What if the code or execution necessary to perform that inspection properly itself is harmful? An AGI is almost certainly a distributed system with no meaningful notion of global state, so I doubt this can be handwaved away.

For example, a lot of distributed database vendors, like Snowflake, do not offer a pre-execution query planner. This can only be performed just-in-time as the query runs or retroactively after it has completed, as the exact structure may be dependent on co-location of data and computation that is not apparent until the data referenced by the query is examined. Moreover, getting an accurate dry-run query plan may be as expensive as executing the query itself.

By analogy, for certain kinds of complex inspection procedures you envision, executing the inspection itself thoroughly enough to be reflective of the true execution risk may be as complex and as great of a risk of being harmful according to its values.

robertzk 10 Jun 2018 17:04 UTC
4 points
0
in reply to: Wei Dai’s comment on: Open question: are minimal circuits daemon-free?
I am interested as well. Please share the docs in question with my LW username at gmail dot com if that is a possibility. Thank you!