I don’t use LessWrong much anymore. Find me at www.turntrout.com.
My name is Alex Turner. I’m a research scientist at Google DeepMind on the Scalable Alignment team. My views are strictly my own; I do not represent Google. Reach me at alex[at]turntrout.com
TurnTrout
Awesome to finally see pretraining experiments. Thank you so much for running these!
Your results bode quite well for pretraining alignment. May well transform how we tackle the “shallowness” of post-training, open-weight LLM defense, alignment of undesired / emergent personas, and just an across-the-board boost in the alignment of the “building blocks” which constitute a pretrained base model. :)
Me (genius): “In the limit, a sufficiently powerful model will eventually manifest instrumentally convergent hostility to human values.”
You (fool): “Wait, what limit are we taking here?”
Me (extra genius):
You (confused fool): “That reasoning seems… questionable.”
Me (resplendently extra genius): “It seems I must explain every trivial point. It should be obvious to the Wise that is only transformed by the identity function, which is continuous. Thus the limit holds. QED. I would suggest you meditate further upon the implications of the null string.”
(Reproduced from @Quintin Pope with permission)
“we don’t need to worry about Goodhart and misgeneralization of human values at extreme levels of optimization.”
That isn’t a conclusion I draw, though. I think you don’t know how to parse what I’m saying as different from that rather extreme conclusion—which I don’t at all agree with? I feel concerned by that. I think you haven’t been tracking my beliefs accurately if you think I’d come to this conclusion.
FWIW I agree with you on a), I don’t know what you mean by b), and I agree with c) partially—meaningfully different, sure.
Anyways, when I talk about “imperfect” values, I’m talking about a specific concept which I probably should have clarified. The model you still need to download to get my view is that Alignment Allows “Non-Robust” Decision-Influences and Doesn’t Require Robust Grading (the culmination of two previous essays).
specially prepared data that looks ordinary to humans, but is seen radically differently by machine learning models.
Not necessarily, humans seem to have these features to a weaker extent: https://www.nature.com/articles/s41467-023-40499-0
we find that adversarial perturbations that fool ANNs similarly bias human choice. We further show that the effect is more likely driven by higher-order statistics of natural images to which both humans and ANNs are sensitive, rather than by the detailed architecture of the ANN.
Turns out that bears are a lot harder to farm and they likely cannot be domesticated at all, I think that explains away any mystery about this specific snack
I’m guessing you think “I’m a citizen. I don’t break laws. I’m not in a directly targeted group. I’m low risk.”
You might be thinking about risk as binary—either you’re targeted for arrest/elimination, or you’re safe. The thesis isn’t “you might get swept up.” The thesis is: “The ‘medium risk’ assessment is based on the principle of ‘authoritarian creep.’” The tools and tactics normalized against one group (immigrants, protesters) invariably get turned against the next, less-popular group.
Disclaimer: This comment is AI-written but human-composed. I spent over an hour thinking about your question, articulating my views, dialoguing with the AI, fact-checking its claims, and adding new content. It’d be a big pain to rewrite everything myself and I want to finish up thinking about this for now, so posting as-is.
Authoritarian regimes exert control in two ways:
Targeted threat against actively persecuted groups (you aren’t here yet), or
Widespread fear against all who disagree with the regime (you absolutely belong to this).
You say you oppose Trump and follow politics closely. That means you have political awareness and opposition. Under ambient fear tactics, you don’t need to be individually hunted down—you just need to know that your legal status won’t protect you if you’re inconvenient.
The infrastructure for widespread fear already exists
Citizen status doesn’t ensure protection
Over 170 US citizens have been wrongly detained by ICE, including George Retes, an Iraq war veteran, who spent three days in jail with pepper spray burns, unable to make a phone call or speak to a lawyer. He wasn’t charged with anything . He was just released with no explanation.
And how hard would it be for ICE to flip an entry in a database?
ICE officials have told us that an apparent biometric match by Mobile Fortify is a ‘definitive’ determination of a person’s status and that an ICE officer may ignore evidence of American citizenship—including a birth certificate—if the app says the person is an alien.
—Ranking member of the House Homeland Security Committee Bennie G. Thompson (D.-Miss.)
Court orders don’t ensure protection
Trump has threatened to invoke the Insurrection Act to override judicial rulings. In Chicago, ICE continues to tear gas protestors and not wear identification in violation of a court order.
Congressional oversight doesn’t ensure protection
Twelve Democratic members of Congress filed a lawsuit after being denied entry to detention facilities in violation of federal law explicitly granting Congress the right to conduct unannounced inspections. Rep. LaMonica McIver was charged with “assaulting law enforcement” for trying to enter—charges she calls “purely political.”
Why this matters for you
ICE ignoring court orders in Chicago shows contempt for the judiciary. The Congressional blockade shows a contempt for the legislature. This creates an unchecked executive. An unchecked executive means all citizens have a higher risk profile, because the legal systems designed to protect you have been proven to be ignorable.
As Bruce Schneier notes: “If ICE targets only people it can go after legally, then everyone knows whether or not they need to fear ICE. If ICE occasionally makes mistakes by arresting Americans and deporting innocents, then everyone has to fear it. This is by design.”
You’re meant to be chilled. Maybe you won’t be put in a camp. Maybe you’ll never be arrested. But maybe:
You’ll lose your job for expressing political views online
You’ll face legal harassment even if charges are eventually dropped
You’ll self-censor because you know opposition has consequences
This is what 1950′s McCarthyism looked like—most people weren’t jailed, but thousands lost jobs, were blacklisted, had their lives destroyed. The threat didn’t need to be execution; it just needed to be real enough to make people shut up.
Medium risk means: You probably won’t be individually hunted down. But you absolutely could face consequences—detention, job loss, legal harassment, having to lawyer up even for bullshit charges—for being a visible Trump opponent. The goal isn’t necessarily to arrest you. The goal is to make you wonder if sending that frustrated text message, or writing that Google Docs comment, or making that donation will put you on a list. The goal is to make you self-censor.
You’re politically aware enough to understand what’s happening. You openly oppose Trump. The system has demonstrated it will ignore your legal protections when convenient. That’s not low risk, that’s medium risk—the infrastructure exists to grab you if you’re inconvenient, and your citizenship won’t stop them.
I don’t know if it’ll get to camps. I don’t know if it’ll get to purges. But I know the ambient fear infrastructure is already functioning, and you’re in the category of people it’s designed to intimidate.
That’s why I recommend taking precautions now, as listed in the article.
If I added hedges about every similar possibility of supply chain attacks due to e.g. non-formally verified build signatures, the guide would grow bloated for reasons outside the comprehension and threat model of the vast majority of my readers. So while I agree with you about the possibility, I don’t think it’s relevant for me to note in the text. (Maybe you agree?)
Calling Proton Mail “E2EE” is pretty questionable.
Yeah. The original article addressed the issue but buried the lede. Will update:
Old: Proton Mail stores your emails e2ee. … (Later warning) Most of your email can still be read in transit by the authorities. If two Proton Mail emails communicate, they automatically use e2ee. However, if e.g. a @gmail.com address sends you something, the content will be plainly visible to the authorities.
New: Proton Mail stores your emails E2EE. If two Proton Mail email addresses communicate, they automatically use E2EE in communicating with each other. However, if e.g. a
@gmail.comaddress sends you something, the content will be plainly visible to the authorities while in the sender’s account (if they seize the data) and during transmission. Once received, Proton encrypts it in your mailbox, but the government could have already intercepted it in transit..
Not only do they handle the plaintext of most of your mail
IMO—Not Proton’s fault, just how email works sadly. I also warn that most emails will be read by authorities via other access points.
they also provide the code you use to handle the plaintext of all your mail.
Yes, but the code is open source and independently audited. I don’t see why I should call this out as a trust deficiency in particular.
I think there are probably occasions when even relatively normal people should be using Tor or I2P, rather than a trustful VPN like Proton or Mullvad. [And, on edit, there is some risk of any of those being treated as suspicious in itself].
Yeah, I agree. I’m adding a section on Tor.
I’d be careful about telling people to keep a lot of cash around. Even pre-Trump, mere possession of “extraordinary” amounts of cash tended to get treated as evidence of criminality.
Thank you. I’m now planning to advise that people keep a small amount of cash (< $2,000) in a fireproof safe with receipts of legitimate withdrawal. High-risk individuals should ultimately consult with asylum experts.
Do you know of that one famous geoguessr guy? What he’s doing is possible with more than just locations
Yes, the appendix of the second article discusses geoguessing. Especially with powerful base models, authorial fingerprinting is concerning but out of scope for most of my readers.
EDIT: Whistleblowers should probably mask their writing, on second thought. Thanks—I’ll add this.
handling over the keys of your privacy to external powers
You fundamentally misunderstand the nature of Bitwarden password management. Bitwarden is zero-knowledge end-to-end encrypted:
Bitwarden encrypts all of the information in your Vault, including the websites you visit, even the names of your individual items and folders. We use the term zero knowledge encryption because only you retain the keys to your Vault, and the entirety of your vault is encrypted. Bitwarden cannot see your passwords, your websites, or anything else that you put in your Vault. Bitwarden also does not know your Master Password. So take good care of it, because if it gets lost, the Bitwarden team cannot recover it for you.
They are also open source and regularly audited. You can also self-host.
Thanks so much. I’ll update the guides on both counts. I’ll also add in a section on Tor.
Yes, Proton is definitely more trustworthy than ISPs in authoritarian countries
Furthermore, Proton claims to keep no logs of your activity and has its no-logs implementation independently audited.
When I’m approached to give again within 24 hours
I donated $7K to Scott and $7K to Bores.
Testimonials
If you’re interested in learning what making progress on a hard problem actually feels like, Team Shard is where you want to be.
— Bruce Lee. MATS 7.0, primary author of Distillation Robustifies Unlearning
I really like Team Shard’s focus on solving big problems that other people are missing. This focus resulted in me doing work that I think is much more impactful than I would have otherwise done. Being in Team Shard is also really fun.
— Luke Marks. MATS 8.0, primary author on Optimizing The Final Output Can Obfuscate CoT
Alex Turner and Alex Cloud provided consistently thoughtful guidance and inspiration that enabled my progress. I also had a ton of fun with the team :)
— Ariana Azarbal. MATS 8.0, primary author on Training a Reward Hacker Despite Perfect Labels
Being a member of Team Shard helped me grow tremendously as a researcher. It gave me the necessary skills and confidence to work in AI Safety full-time.
— Jacob Goldman-Wetzler. MATS 6.0, primary author of Gradient Routing, now working at Anthropic
The mentors are ambitious and set high expectations, but are both super friendly and go out of their way to create a healthy, low-stress atmosphere amongst the team, ideal for brainstorming and collaboration. This collaborative environment, combined with their strong high-level research taste, has consistently led to awesome research outputs.
My time on Team Shard set the bar for what a productive collaboration should look like.
— Jacob Drori. MATS 8.0, primary author of Optimizing The Final Output Can Obfuscate CoT (Research Note)
Apply for MATS mentorship at Team Shard before October 2nd. Alex Cloud (@cloud) and I run this MATS stream together. We help alignment researchers grow from seeds into majestic trees. We have fun, consistently make real alignment progress, and have a dedicated shitposting channel.
Our mentees have gone on to impactful jobs, including (but not limited to)
@lisathiergart (MATS 3.0) moved on to being a research lead at MIRI and now a senior director at the SL5 task force,
@cloud (MATS 6.0) went from mentee to co-mentor in one round and also secured a job at Anthropic, and
@Jacob G-W (MATS 6.0) also accepted an offer from Anthropic!
We likewise have a strong track record in research outputs, including
Pioneering steering vectors for use in LLMs (Steering GPT-2-XL by adding an activation vector),
Masking Gradients to Localize Computation in Neural Networks, and
Our team culture is often super tight-knit and fun. For example, in this last MATS round, we lifted together every Wednesday and Thursday.
Apply here before October 2nd. (Don’t procrastinate, and remember the planning fallacy!)
Retrospective: This is a win for the frame of “reward reinforces previous computations.” Ever since 2022, I’ve thought of “reward” as reinforcing the computations which led to the reward and as a chisel which carves circuits into the policy. From “Reward is not the optimization target”:
What reward actually does is reinforce computations which lead to it…
I suggest that you mechanistically model RL agents as executing behaviors downstream of past reinforcement (e.g. putting trash away), in addition to thinking about policies which are selected for having high reward on the training distribution (e.g. hitting the button). The latter form of reasoning skips past the mechanistic substance of reinforcement learning: The chiseling of computations responsible for the acquisition of the cognition-updater...
In my view, reward’s proper role isn’t to encode an objective, but a reinforcement schedule, such that the right kinds of computations get reinforced within the AI’s mind.
By thinking about reward in this way, I was able to predict[1] and encourage the success of this research direction.
Ariana showed that in this coding environment, it’s not just about what the AI ends up choosing but also why the AI made that choice to begin with. Even though we “perfectly” reinforce the AI for doing what we wanted (i.e. avoiding special cases), we also often reinforced the system for the wrong reasons (i.e. considering special-casing the algorithm, even when not asked to do so). The AI’s propensity to consider doing the wrong thing was reinforced and so the AI generalized to hack more in-distribution.
Assuming these results generalize, the trained policy is not just determined by the outputs which get rewarded. The trained policy also depends on which intermediate computations get rewarded.
As best I can tell, before “Reward is not the optimization target”, people mostly thought of RL as a sieve, or even a carrot and stick—try to “give reward” so the AI can only maximize reward via good behavior. Few[2] other people speculated that RL generalization is controlled by why the policy took an action. So I give myself and @Quintin Pope[3] a bunch of points.
- ^
To be clear, my prediction was not as precise as “I bet you can reinforce sus CoTs and get sus generalization.” The brainstorming process went like:
What are some of the most open important problems in alignment? → Reward hacking
What are common assumptions about reward hacking? Oh, yeah, that hacking comes from reward function imperfections.
Hmm I wonder whether models can be trained to reward hack even given “perfect” feedback
We should really think more about this
Time passes, continue encouraging research into the importance of CoT and prompts in RL (thinking about RL using the chisel-frame, as I ~always do)
Victor and Ariana get this result.
Perhaps Steve Byrnes is an exception.
- ^
Quintin and I came up with “Reward is not the optimization target” together.
- ^
My inside-view perspective: MIRI failed in part because they’re wrong and philosophically confused. They made incorrect assumptions about the problem, and so of course they failed.
naïvely
I did my PhD in this field and have authored dozens of posts about my beliefs, critiques, and proposals. Specifically, many posts are about my disagreements with MIRI/EY, like Inner and Outer Alignment Decompose One Hard Problem Into Two Extremely Hard Problems (voted into the top 10 of the LessWrong review for that year), Many Arguments for AI X-Risk Are Wrong, or Some of My Disagreements with List of Lethalities. You might disagree with me, but I am not naive in my experience or cavalier in coming to this conclusion.
Nice work. What a cool use of steering vectors!
In a thread which claimed that Nate Soares radicalized a co-founder of e-acc, Nate deleted my comment – presumably to hide negative information and anecdotes about how he treats people. He also blocked me from commenting on his posts.
The information which Nate suppressed
The post concerned (among other topics) how to effectively communicate about AI safety, and positive anecdotes about Nate’s recent approach. (Additionally, he mentions “I’m regularly told that I’m just an idealistic rationalist who’s enamored by the virtue of truth”—a love which apparently does not extend to allowing people to read negative truths about his own behavior.)
Here are the parents of the comment which Nate deleted:
@jdp (top-level comment)
For what it’s worth I know one of the founders of e/acc and they told me they were radicalized by a date they had with you where they felt you bullied them about this subject.
@Mo Putera (reply to jdp)
Full tweet for anyone curious:
i’m reminded today of a dinner conversation i had once w one of the top MIRI folks...
we talked AI safety and i felt he was playing status games in our conversation moreso than actually engaging w the substance of my questions- negging me and implying i was not very smart if i didn’t immediately react w fear to the parable of the paperclip, if i asked questions about hardware & infrastructure & connectivity & data constraints...
luckily i don’t define myself by my intelligence so i wasn’t cowed into doom but instead joined the budding e/acc movement a few weeks later.
still i was unsettled by the attempted psychological manipulation and frame control hiding under the hunched shoulders and soft ever so polite voice.
My deleted comment (proof) responded to Mo’s record of the tweet:
For those unfamiliar with this situation, see also a partial list of “(sometimes long-term) negative effects Nate Soares has had on people while discussing AI safety.” (About 2⁄3 of the list items involve such discussions.)
The e/acc cofounder wrote:
we talked AI safety and i felt he was playing status games in our conversation moreso than actually engaging w the substance of my questions- negging me and implying i was not very smart if i didn’t immediately react w fear to the parable of the paperclip
This mirrors my own experience:
I, personally, have been on the receiving end of (what felt to me like) a Nate-bulldozing, which killed my excitement for engaging with the MIRI-sphere, and also punctured my excitement for doing alignment theory...
Discussing norms with Nate leads to an explosion of conversational complexity. In my opinion, such discussion can sound really nice and reasonable, until you remember that you just wanted him to e.g. not insult your reasoning skills and instead engage with your object-level claims… but somehow your simple request turns into a complicated and painful negotiation. You never thought you’d have to explain “being nice.”
Then—in my experience—you give up trying to negotiate anything from him and just accept that he gets to follow whatever “norms” he wants.
Why did Nate delete negative information about himself?
Nate gave the reasoning “Discussion of how some people react poorly to perceived overconfidence[1] is just barely topical. Discussion of individual conduct isn’t.”. But my anecdote is a valid report of the historical consequences of talking with Nate – just as valid as the e/acc co-founder’s tweet. Several other commenters had already supported the e/acc tweet information as quite relevant to the thread.
Therefore, I conclude that Nate deleted the true information I shared because it made him look bad.
EDIT: Nate also blocked me from commenting on his posts:
- ^
See how Nate frames the issue as “reacting poorly to perceived overconfidence”, which is not how the e/acc co-founder described her experience. She called it “psychological manipulation” but did not say she thought Nate being overconfident was an issue. Nate deflects from serious charges (“psychological manipulation”) to a charge which would be more convenient for him (“overconfidence”).
- ^
I disagree with much of what you wrote.
EDIT: Actually, this is correct. I kept reading and found specific information supporting your point. Thanks!
I think the reason this is salient is, DHS only claimed after the fact that they arrested him for assault. At the time he wasn’t given info, so he remarked “wtf my ID was right there, why am I being arrested when I can prove citizenship?”.
ICE goes around laws to draw extra data all the time (though that’s read access, not write). Nominal access controls are not being respected right now (though that doesn’t mean every single control is being violated). You can also look at DOGE / social security data, etc.
I don’t think it’s reasonable to call this word-of-mouth. My comment provided credible evidence that ICE officials made this claim. Maybe it isn’t widespread yet, and maybe it won’t end up happening, but you’re downplaying the chance this happens and overestimating the care ICE demonstrates towards citizens. See also the planned denaturalization quota of 100–200/month in 2026
I can tell you that quite a few of my friends (my target demographic for this article!) already report their speech being chilled. It’s happening, at least for some groups I care about. Large protests are not strong counterevidence.