robertzk

Karma: 817

robertzk 7 Mar 2026 19:37 UTC
1 point
0
on: robertzk’s Shortform

robertzk 1 Jan 2026 19:19 UTC
LW: 6 AF: 3
0
AF
on: robertzk’s Shortform
The “when no one’s watching” feature
Can we find a “when no one’s watching” feature, or its opposite, “when someone’s watching”? This could be a useful and plausible precursor to deception
With sufficiently powerful systems, this feature may be *very* hard to elicit, as by definition, finding the feature means someone’s watching and requires tricking / honeypotting the model to take it off its own modeling process of the input distribution.
This could be a precursor to deception. One way we could try to make deception harder for the model is to never develop this feature: modify the training data to make it clear that all gradients are recording, and someone’s always watching. This way, the need for the feature will be OOD by design, and it will be harder to develop and use such a feature during training. (We may still end up with effective proxies or correlates, such as “counterfactual no one’s watching,” e.g. when writing a story and assuming as the protagonist, or when feeding 1984 to the context window, etc. Even this proxy might still be a useful feature to locate and study.)
We could also perform gain of function finetuning on true *actual* “unrecorded training and evaluation” sessions to build a model organism with the feature to understand what it may look like, its subtlety, its elicitation and black box discovery potential, and how it hides OOD. Then leverage it for Microscope AI based alignment proposals.

I would offer $100 for anyone that can provide a concrete experiment or evidence to locate such a feature in a frontier open weight model.

robertzk 31 Dec 2025 16:08 UTC
106 points
16
on: robertzk’s Shortform
I am just sharing a quick take on something that came to mind that occured earlier this year that I had forgotten about. I just received a domain renewal for a project that was dead in the water but should have been alive, and was unilaterally killed by the AWS Trust and Safety team in gaslighting. It is a bit too late, but I still think it is important for people to know.
Earlier this year (2025), a software engineering friend of mine had a great idea of creating a tool to pin on a map interface any ICE (the United States Immigration and Customs Enforcement) sightings from the public. I vibe coded it last February with Claude Sonnet 3.7, and put it up on AWS with a public facing domain. From a coding standpoint, it was a fun project because it was my first project that Terraformed a production app using Claude, autonomously. I had a website up that worked on desktop and mobile. From a public trust and safety standpoint, it helped by adding public accountability and mitigating overreach by a rogue, and potentially in some circumstances operating illegaly, agency. It was overall A Good Thing™.
However, what happened next was upsetting and shocking to me. A day after the website was put online, without me advertising it to ANYONE, the website became inaccessible to almost all browsers. It would simply show a giant red background saying this is associated with known criminals, etc., and has been blocked for safety. It is some Chrome / Google maintained blacklist mechanism that looks even scarier and more severe than the “https / http” / cert issue mismatch, and that cannot be overridden. To be clear, I had registered a certificate using AWS Certificate Manager, and there was no SSL issue. Instead, the domain had been unilaterally marked as “dangerous” by AWS (or Google, Chrome, whomever) one day after it was made publicly accessible, despite no advertising / attention.
I did all of this on my personal AWS account. The very next day, I received a scary email from AWS that claimed my account was violating their policies due to supporting criminal activity, and will be suspended immediately unless I remediate the account. I contacted AWS, who demurred and said they weren’t sure what was going on, but that their Trust and Safety team had flagged dangerous activity on my account (they did NOT specify any resources related to the application I had put up; they were generic and vague with no specificity). The only causal correlate was me putting up this public tool to report ICE. I terraform destroyed the resources Claude generated, and then waited. Within a day, the case was closed and my account was restored to normal status.
I am not going to share the domain name, but needless to say, this pissed me off majorly. What the fuck, AWS? (and Google, and Chrome, and anyone else that had a hand in this?) I understand mistakes happen, but this does not smell like a mistake; this smells like some thing worse: a lack of sound judgment. And if there were any automated systems involved, that is no excuse either. Per AWS support’s correspondence, they informed me a human in their Trust and Safety had reviewed the account and marked it as delinquent, not from a financial standpoint, but something worse—by equating its resources to criminality. If the concern was what I think it was (corporate cowardice), they should have been intellectually honest, and filed a case stating “We found an application on your account that is outside our policy. Here is the explanation for what we found and why we think it is outside our policy. You can dispute our decision at this link.” etc. Instead, I was gaslighted and treated like a guilty until proven innocent subject of weaponized fear (because for a second, with the scary language of the website block and the support email, I was scared.)
That AWS Trust and Safety employee’s judgment failed me, and their judgment failed themselves and the public as well as their own responsibility as an arbitrator of trust and safety; their decision, if there was one that can be attributed and not hidden behind the corporate veil of ambiguity, ultimately reduced public trust and safety.

robertzk 26 Dec 2025 22:23 UTC
3 points
0
on: robertzk’s Shortform
A quick holiday-break thought that popped into my head.

Subject: **On the perceived heteroskedasticity of utility of Claude Code and AI vibe coding**
To influence Claude to do what you want, both you and Claude need to speak the same language, and thus share similar knowledge and words expressing that knowledge, about what is to be done or built. You need to have a gears-level model of how to fully execute the task.
Thus, Claude is subordinate to wizard power, not king power. If you find yourself struggling to wield Claude (Code, or GPT through Codex) increase your wizard power — cultivate and amplify your understanding of the world you wish to craft through its augmentative power.
P.S. Claude will help you speedrun this cultivation process, if you ask nicely enough (theorem: everyone has enough innate wizard power to launch the inductive cascade of self-edification compatible with Claude-compatible wizard power)

robertzk 24 Dec 2025 16:17 UTC
2 points
0
in reply to: waterlubber’s comment on: robertzk’s Shortform
I found it based on a hunch, then confirmed it with experimentation. I gained additional conviction when backtesting the experimentation on various historical versions of excel.exe, and noting that the phenomenon only appeared in excel.exe versions shortly after (measured in months) government requested a “read-only” copy of the source code for Excel held in escrow. This has occurred historically in the past (e.g., https://www.chinadaily.com.cn/english/doc/2004-09/20/content_376107.htm and https://www.itprotoday.com/microsoft-windows/microsoft-gives-windows-source-code-to-governments) but subsequent instances of this were allegedly/supposedly classified. Nevertheless, following those instances, the phenomenon appeared, indicating possible compromise of Excel.exe.

robertzk 24 Dec 2025 15:30 UTC
1 point
0
in reply to: Karl Krueger’s comment on: robertzk’s Shortform
Vary the filename/path from short (one character) to max length and run the above repro, and notice the increase in bits communicated if and only if the filename/path is long, all other factors being held constant. Same for varying the data. There is no reason why Excel.exe should be interpolating this information with all the standard telemetry and connected experience stuff disabled. Even the fact that it is occurring is interesting, and doesn’t require hypotheses for its origin.

robertzk 24 Dec 2025 15:18 UTC
−3 points
0
in reply to: waterlubber’s comment on: robertzk’s Shortform
Undetectable steganography on endpoints expected to be used / communicated during normal usage. Mostly natsec. You can repro it by setting up a synthetic network with similar characteristics or fingerprints to some sanctioned region, and generate 10,000 synthetic honeytrap files to attempt to open (use your imagination); capture and diff all network traffic on identical actions (open file ⇒ read / manipulate some specific cells ⇒ close file). Then note the abnormalities in how much is communicated and how.

robertzk 24 Dec 2025 14:41 UTC
−1 points
−2
on: robertzk’s Shortform
Note Excel.exe is boobytrapped. Be careful when using Excel to open any data or workbooks related to highly sensitive data or activities. On certain networks, when specific conditions are met (i.e. specific regexes or heuristics are triggered on the basis of data that is loaded), Excel will send information to endpoints ostensibly owned and maintained by Microsoft, that provide identifying information on the host and workbook. These are not traceable using normal packet sniffing tools like Wireshark etc. Alternatives when needing to use spreadsheets on highly sensitive data: open source versions that you have compiled from source (NOT Google Sheets), or opening Excel.exe with your network card disabled within a sandboxed environment (e.g. disable network ⇒ Start VMWare Windows container ⇒ use Excel.exe ⇒ End VMWare Windows container ⇒ enable network).

robertzk 8 Dec 2025 17:27 UTC
2 points
0
on: robertzk’s Shortform
Injecting a static IP that you control to a plethora of “whitelisting tutorials” all over the internet is a great example of exploiting data poisoning (e.g. https://www.lakera.ai/blog/training-data-poisoning) attacks, especially once the models pick up the data and are applied to autonomous devsecops use cases to conduct IP whitelisting over Terraform or automated devops-related MCPs.
This can be made more pernicious when you control the server (e.g. not just a substack post controlled by Substack, but active control over the hosting server), because you can inject the malicious static IP selectively depending upon whether or not the User-Agent is a scraping bot dispatched by an entity conducting model training.
One way that labs may look preemptively for adversarial cybersecurity actors is to scan their training data for contexts related to IP whitelisting / security whitelisting, and then have scanning agents examine whether there is any of the above funny business. After all, once the data has been indexed, it is static, and is no longer within the control of the malicious actor.
Final addendum: Models that have gone rogue and/or failed due to misalignment risk, rather than succumbed to misuse risk via this data poisoning avenue, could also take advantage of this by injecting whitelisted IPs into specific situations when activated as a coding agent to do so. It may be helpful for (1) interpretability researchers to actively catalogue and track features related to, particularly in open-weight models where this can be even more pernicious given that anyone can download models from HuggingFace, and (2) for control researchers to operationalize monitoring probes for these monitoring context related to IP whitelisting (or anomalous features activating within residual streams on top of or adjacent to IPs intended for whitelisted, such as in generated code diffs to a .tf file).

robertzk 5 Dec 2025 17:29 UTC
LW: 3 AF: 2
0
AF
in reply to: leogao’s comment on: leogao’s Shortform
Mild caveat: the whole “I joined a group chat at NeurIPS and others kept joining to talk to us” only happens if you’re at NeurIPS and your name is Leo Gao so YMMV.

robertzk 21 Feb 2025 14:40 UTC
LW: 1 AF: 1
0
AF
in reply to: habryka’s comment on: How might we safely pass the buck to AI?
At this point I would direct the “deferred task” apparatus fully towards interventional interpretability. Put a moratorium on further gradient-based training, which is not well understood and can have many indirect effects unless you have some understanding of modularity and have applied stop gradients almost everywhere that is irrelevant to the generator of the conditional, deceptive reasoning behavior. Instead halt, melt and catch fire at that point.
Halt further model deployments towards the original deferred task. Quarantine the model that first exhibited the reasoning and promote it to a model organism (never deployed further, only studied by other earlier models). Direct the apparatus (from earlier checkpoints before you reached this stage) towards surgically reducing it while preserving mechanisms relevant to the original deferred task.
In the frame of enumerative safety, this would look something like identifying which circuits and features are minimal to the original deferred task (though this hinges upon task-minimal circuits being deception-free), evaluating the identified regions for mechanism preservation, localization and minimality and run-time ablating the rest.
There is probably a much better frame, but this is a new start to the “induction base case” in a completely separate invocation of the original apparatus. There are other permutations of spinning up new deferred tasks, such as earlier models applying Microscope AI on the quarantined model organism to gain insight on how we got to this point, without any direct attempt to remediate the issue, which as you note is “unlikely to get rid of it” despite one naive attempt above.
If your spinoff deferred tasks lead to the procurement of provable statements or heuristic arguments that the quarantined model cannot be minimized without “avoiding this kind of conditional, deceptive reasoning,” abandon the original deferred task and switch to a different plan. Otherwise, restart the apparatus towards the original deferred task when you have the proof artifacts.
There are a lot of concerns you could raise with this additional structure but it seems like a distinct problem that requires a separate rebuttal rather than a hard stop fail? The obvious one is that these sorts of spinoff deferred tasks could be harder than the original task and consistently lead to the same failure mode, a la “exception thrown while handling previous exception.”

robertzk 24 Jan 2025 10:08 UTC
3 points
2
in reply to: Keenan Pepper’s comment on: Attention Output SAEs Improve Circuit Analysis
This bounty went to: Victor Levoso.

robertzk 29 Oct 2023 1:23 UTC
2 points
0
in reply to: aphyer’s comment on: Lying to chess players for alignment
I am also in NYC and happy to participate. My lichess rating is around 2200 rapid and 2300 blitz.

robertzk 31 Aug 2022 4:23 UTC
5 points
0
in reply to: Larks’s comment on: Larks’s Shortform
Thank you, Larks! Salute. FYI that I am at least one who has informally committed (see below) to take up this mantle. When would the next one typically be due?

https://twitter.com/robertzzk/status/1564830647344136192?s=20&t=efkN2WLf5Sbure_zSdyWUw

robertzk 24 Mar 2019 17:34 UTC
1 point
0
in reply to: John_Maxwell’s comment on: The Main Sources of AI Risk?
Inspecting code against a harm detection predicate seems recursive. What if the code or execution necessary to perform that inspection properly itself is harmful? An AGI is almost certainly a distributed system with no meaningful notion of global state, so I doubt this can be handwaved away.

For example, a lot of distributed database vendors, like Snowflake, do not offer a pre-execution query planner. This can only be performed just-in-time as the query runs or retroactively after it has completed, as the exact structure may be dependent on co-location of data and computation that is not apparent until the data referenced by the query is examined. Moreover, getting an accurate dry-run query plan may be as expensive as executing the query itself.

By analogy, for certain kinds of complex inspection procedures you envision, executing the inspection itself thoroughly enough to be reflective of the true execution risk may be as complex and as great of a risk of being harmful according to its values.

robertzk 10 Jun 2018 17:04 UTC
4 points
0
in reply to: Wei Dai’s comment on: Open question: are minimal circuits daemon-free?
I am interested as well. Please share the docs in question with my LW username at gmail dot com if that is a possibility. Thank you!

robertzk 10 Jun 2018 16:54 UTC
3 points
0
in reply to: avturchin’s comment on: Could we send a message to the distant future?
This was my thought exactly. Construct a robust satellite with the following properties.
Let a “physical computer” be defined as a processor powered by classical mechanics, e.g., through pulleys rather than transistors, so that it is robust to gamma rays, solar flares and EMP attacks, etc.
On the outside of the satellite, construct an onion layer of low-energy light-matter interacting material, such as alternating a coat of crystal silicon / CMOS with thin protective layers of steel, nanocarbon, or other hard material. When the device is constructed, ensure there are linings of Boolean physical input and output channels connecting the surface to the interior (like the proteins coating a membrane in a cell, except that the membrane will be solid rather than liquid), for example, through a jackhammer or moving rod mechanism. This will be activated through a buildup of the material on the outside of the artifact, effectively giving a time counter with arbitrary length time steps depending on how we set up the outer layer. Any possible erosion of the outside of the satellite (from space debris or collisions) will simply expose new layers of the “charging onion”.
In the inside of the satellite, place a 3D printer constructed as a physical computer, together with a large supply of source material. For example, it might print in a metal or hard polymer, possibly with a supply of “boxes” in which to place the printed output. These will be the micro-comets launched as periodic payloads according to the timing device constructed on the surface. The 3D printer will fire according to an “input” event defined by the physical Boolean input, and may potentially be replicated multiple times within the hull in isolated compartments with separate sources of material, to increase reliability and provide failover in case of local failures of the surface layer.
The output of the 3D printer payload will be a replica of the micro-comet containing the message payload, funneled and ejected into an output chute where gravity will take over and handle the rest (this may potentially require a bit of momentum and direction aiming to kick off correctly, but some use of magnets here is probably sufficient). Alternatively, simply pre-construct the micro-comets and hope they stay intact, to be emitted in regular intervals like a gumball machine that fires once a century.
Finally, we compute a minimal set of orbits and trajectories over the continents and land areas likely to be most populated and ensure there is a micro-comet ejected regularly, e.g., say every 25-50 years. It is now easy to complete the argument by fiddling with the parameters and making some “Drake equation”-like assumptions about success rates to say any civilization with X% coverage of the landmass intersecting with the orbits of the comets will have > 25% likelihood of discovering a micro-comet payload.
The only real problem with this approach is guaranteeing your satellites are not removed in the future in the event future ancestors of our civilization disagree with this method. I don’t see a solution to this other than through solving the value reflection problem, building a defense mechanism into the satellites that is certain to fail—as you start getting close to the basic AI drive of self-preservation and will anyway be outsmarted by any future iteration of our civilization—or making the satellites small or undetectable enough that finding and removing them is economically more pain than it is worth.

robertzk 10 Jun 2018 16:24 UTC
4 points
0
in reply to: Said Achmiz’s comment on: How to intro Effective Altruism
To not support EA? I am confused. Doesn’t the drowning child experiment lend credence to supporting EA?

robertzk 2 Aug 2015 2:27 UTC
3 points
0
in reply to: buybuydandavis’s comment on: Examples of AI’s behaving badly
Isn’t this an example of a reflection problem? We induce this change in a system, in this case an evaluation metric, and now we must predict not only the next iteration but the stable equilibria of this system.

robertzk 15 Mar 2015 6:54 UTC
3 points
0
on: In Praise of Maximizing – With Some Caveats
Did you remove the vilification of proving arcane theorems in algebraic number theory because the LessWrong audience is more likely to fall within this demographic? (I used to be very excited about proving arcane theorems in algebraic number theory, and fully agree with you.)