When making safety cases for alignment, its important to remember that defense against single-turn attacks doesn’t always imply defense against multi-turn attacks.
Our recent paper shows a case where breaking up a single turn attack into multiple prompts (spreading it out over the conversation) changes which models/guardrails are vulnerable to the jailbreak.
Robustness against the single-turn version didn’t imply robustness against the multi-turn version of the attack, and robustness against the multi-turn version didn’t imply robustness against the single-turn version of the attack.
I expect that within a year or two, there will be an enormous surge of people who start paying a lot of attention to AI.
This could mean that the distribution of who has influence will change a lot. (And this might be right when influence matters the most?)
I claim: your effect on AI discourse post-surge will be primarily shaped by how well you or your organization absorbs this boom.
The areas I’ve thought the most about this phenomena are:
AI safety university groups
Non agi lab research organizations
AI bloggers / X influencers
(But this applies to anyone who’s impact primarily comes from spreading their ideas, which is a lot of people.)
I think that you or your organization should have an explicit plan to absorb this surge.
Unresolved questions:
How much will explicitly planning for this actually help absorb the surge? (Regardless, it seems worth a google doc and a pomodoro session to at least see if there’s anything you can do to prepare)
How important is it to make every-day people informed about AI risks? Or is influence so long-tailed that it only really makes sense to build reputation with highly influential people? (Though- note that this surge isn’t just for every day people — I expect that the entire memetic landscape will be totally reformed after AI becomes clearly a big deal, and that applies to big shot government officials along with your average joe)
I’d be curious to see how this looked with Covid:
Did all the covid pandemic experts get an even 10x multiplier in following? Or were a handful of Covid experts highly elevated, while the rest didn’t really see much of an increase in followers? If the latter, what did those experts do to get everyone to pay attention to them?
What happens if all of the local datacenter fights across America become way more successful? This functionally seems similar to a data center moratorium, and might actually be easier.
After meeting with a few of these groups, my impression is that the vast majority of American AI datacenter fights are operating with basically zero financial help, and remarkably little legal support. I’ve seen multiple campaigns run by people who basically struggled to raise enough money to even print signs and somehow ended up winning or significantly delaying the project. On aggregate, these fights manage to be very successful with hardly any resources.
In the extreme case, what if you just give a $100,000 grant to every single ongoing AI data center fight in America (source: https://datacentertracker.org/) to get them all equipped with great legal and advocacy help? This would cost around $23 million. (One could imagine weighing each grant by the datacenters projected energy usage.)
To put more emphasis on this point: I think a single medium-sized donor could significantly change the rate of AI data center development in America.
It seems the safety community generally support Bernie’s proposed AI data center moratorium. I think supporting grassroots data center fights is a less robust version, but it seems to captures a substantial fraction of the value, while being surprisingly cost effective. But maybe people just don’t think it’s net positive to slow down development by supporting these communities? If so, I’m super curious to hear why.
Should it be more tabooed to put the bottom line in the title?
Titles like “in defense of <bottom line>” or just “<bottom line>” seem to:
Unnecessarily make it really easy for people to select content to read based on the conclusion it comes to
Frame the post as having the goal of convincing you of <bottom line>, and setting up the readers expectation as such. This seems like it would either put you in pause critical thinking to defend My Team mode (if you agree with the title), or continuously search for counter-arguments mode (if you disagree with the title).
I think putting the conclusion in the title is good insofar it’s a form of anti-clickbait: It’s the most informative title possible. Yes, people may be motivated to read it in order to confirm their pre-existing opinion, or to search for counterarguments, but the alternative is often that they don’t read the article at all, for a lack of motivation.
People who are motivated to write a comment from a disagreement with the title are, more or less, forced to read the actual post in order to compose their rebuttal. Which is better than not receiving any engagement from this person at all. And perhaps this post even changes their mind, or they agree with the title but find the arguments in the post too weak.
Overall, having the conclusion in the title seems good for similar reasons a summary in the beginning is good.
Though a reason to avoid the bottom line in the title is if it is some generally unpopular opinion. Many people will reflexively downvote the post without reading, causing it to be seen by fewer readers.
Seems like you can get pretty far by just having current opus 4.6 Claude code run for a week. Only problem is that this is prohibitively expensive.
My impression is that running something like Deepseek for a week straight doesn’t really get you much?
If inference costs per model are declining somewhere between 3x-10x+ per year this alone will get economical quite soon. What projects do you have up your sleeve for when this is viable?
My personal pet project I want to try this method on is preventing all of us from dying from misaligned AGI. ;) I want to try next-gen systems for deconfusion and conceptual clarification in the relevant domains.
I think even with scaffolding for more careful reasoning, Opus 4.6 probably isn’t quite smart or truth-seeking enough to do this as well as a smart human. But I’m not sure. I think it can be made smarter by instructing Claude Code (or Codex) it to use a reasoning process more like a human would when doing a long-term research project to clarify concepts in a complex domain. This is one way in which Human-like metacognitive skills will reduce LLM slop and aid alignment and capabilities. I doubt this will be enough on its own, but in combination with next-generation systems with somewhat better metacognition from training, it might help.
My goal would be to have a pretty straightforward set of prompts that’s obviously truth-seeking, so that if anyone runs it even with prompting for assumptions hostile AGI x-risk, the system comes back with “based on the conceptual uncertainties, humans should try to slow down AI progress and work harder on alignment if at all possible”.
The other target would be conceptual clarifications on exactly how much and what sorts of alignment we’re likely to need to survive.
Of course this path includes the risk of The Median Doom-Path: Slop, not Schemingl as Wentworth puts it: we use AI for conceptual alignment research and it helps confuse us. But this seems inevitable, so having independent researchers trying to make this go better seems like a good idea.
The scene in planecrash where Keltham gives his first lecture, as an attempt to teach some formal logic (and a whole bunch of important concepts that usually don’t get properly taught in school), is something I’d highly recommend reading! As far as I can remember, you should be able to just pick it up right here, and follow the important parts of the lecture without understanding the story
I’ve been maxing out my Claude code usage over the last week to compile a pretty comprehensive database of community fights against American AI datacenters. I gathered information on 368 grassroots AI datacenter fights across the country (118 of which are ongoing). As far as I know, a database like this hasn’t been publicly aggregated anywhere before. The closest is datacenter watch, which publishes reports on the topic but doesn’t offer an open, comprehensive database.
When making safety cases for alignment, its important to remember that defense against single-turn attacks doesn’t always imply defense against multi-turn attacks.
Our recent paper shows a case where breaking up a single turn attack into multiple prompts (spreading it out over the conversation) changes which models/guardrails are vulnerable to the jailbreak.
Robustness against the single-turn version didn’t imply robustness against the multi-turn version of the attack, and robustness against the multi-turn version didn’t imply robustness against the single-turn version of the attack.
I expect that within a year or two, there will be an enormous surge of people who start paying a lot of attention to AI.
This could mean that the distribution of who has influence will change a lot. (And this might be right when influence matters the most?)
I claim: your effect on AI discourse post-surge will be primarily shaped by how well you or your organization absorbs this boom.
The areas I’ve thought the most about this phenomena are:
AI safety university groups
Non agi lab research organizations
AI bloggers / X influencers
(But this applies to anyone who’s impact primarily comes from spreading their ideas, which is a lot of people.)
I think that you or your organization should have an explicit plan to absorb this surge.
Unresolved questions:
How much will explicitly planning for this actually help absorb the surge? (Regardless, it seems worth a google doc and a pomodoro session to at least see if there’s anything you can do to prepare)
How important is it to make every-day people informed about AI risks? Or is influence so long-tailed that it only really makes sense to build reputation with highly influential people? (Though- note that this surge isn’t just for every day people — I expect that the entire memetic landscape will be totally reformed after AI becomes clearly a big deal, and that applies to big shot government officials along with your average joe)
I’d be curious to see how this looked with Covid: Did all the covid pandemic experts get an even 10x multiplier in following? Or were a handful of Covid experts highly elevated, while the rest didn’t really see much of an increase in followers? If the latter, what did those experts do to get everyone to pay attention to them?
Can anyone think of alignment-pilled conservative influencers besides Geoffrey Miller? Seems like we could use more people like that...
Maybe we could get alignment-pilled conservatives to start pitching stories to conservative publications?
What happens if all of the local datacenter fights across America become way more successful? This functionally seems similar to a data center moratorium, and might actually be easier.
After meeting with a few of these groups, my impression is that the vast majority of American AI datacenter fights are operating with basically zero financial help, and remarkably little legal support. I’ve seen multiple campaigns run by people who basically struggled to raise enough money to even print signs and somehow ended up winning or significantly delaying the project. On aggregate, these fights manage to be very successful with hardly any resources.
In the extreme case, what if you just give a $100,000 grant to every single ongoing AI data center fight in America (source: https://datacentertracker.org/) to get them all equipped with great legal and advocacy help? This would cost around $23 million. (One could imagine weighing each grant by the datacenters projected energy usage.)
To put more emphasis on this point: I think a single medium-sized donor could significantly change the rate of AI data center development in America.
It seems the safety community generally support Bernie’s proposed AI data center moratorium. I think supporting grassroots data center fights is a less robust version, but it seems to captures a substantial fraction of the value, while being surprisingly cost effective. But maybe people just don’t think it’s net positive to slow down development by supporting these communities? If so, I’m super curious to hear why.
Should it be more tabooed to put the bottom line in the title?
Titles like “in defense of <bottom line>” or just “<bottom line>” seem to:
Unnecessarily make it really easy for people to select content to read based on the conclusion it comes to
Frame the post as having the goal of convincing you of <bottom line>, and setting up the readers expectation as such. This seems like it would either put you in pause critical thinking to defend My Team mode (if you agree with the title), or continuously search for counter-arguments mode (if you disagree with the title).
I think putting the conclusion in the title is good insofar it’s a form of anti-clickbait: It’s the most informative title possible. Yes, people may be motivated to read it in order to confirm their pre-existing opinion, or to search for counterarguments, but the alternative is often that they don’t read the article at all, for a lack of motivation.
People who are motivated to write a comment from a disagreement with the title are, more or less, forced to read the actual post in order to compose their rebuttal. Which is better than not receiving any engagement from this person at all. And perhaps this post even changes their mind, or they agree with the title but find the arguments in the post too weak.
Overall, having the conclusion in the title seems good for similar reasons a summary in the beginning is good.
Though a reason to avoid the bottom line in the title is if it is some generally unpopular opinion. Many people will reflexively downvote the post without reading, causing it to be seen by fewer readers.
Seems like you can get pretty far by just having current opus 4.6 Claude code run for a week. Only problem is that this is prohibitively expensive.
My impression is that running something like Deepseek for a week straight doesn’t really get you much?
If inference costs per model are declining somewhere between 3x-10x+ per year this alone will get economical quite soon. What projects do you have up your sleeve for when this is viable?
My personal pet project I want to try this method on is preventing all of us from dying from misaligned AGI. ;) I want to try next-gen systems for deconfusion and conceptual clarification in the relevant domains.
I think even with scaffolding for more careful reasoning, Opus 4.6 probably isn’t quite smart or truth-seeking enough to do this as well as a smart human. But I’m not sure. I think it can be made smarter by instructing Claude Code (or Codex) it to use a reasoning process more like a human would when doing a long-term research project to clarify concepts in a complex domain. This is one way in which Human-like metacognitive skills will reduce LLM slop and aid alignment and capabilities. I doubt this will be enough on its own, but in combination with next-generation systems with somewhat better metacognition from training, it might help.
My goal would be to have a pretty straightforward set of prompts that’s obviously truth-seeking, so that if anyone runs it even with prompting for assumptions hostile AGI x-risk, the system comes back with “based on the conceptual uncertainties, humans should try to slow down AI progress and work harder on alignment if at all possible”.
The other target would be conceptual clarifications on exactly how much and what sorts of alignment we’re likely to need to survive.
Of course this path includes the risk of The Median Doom-Path: Slop, not Schemingl as Wentworth puts it: we use AI for conceptual alignment research and it helps confuse us. But this seems inevitable, so having independent researchers trying to make this go better seems like a good idea.
The scene in planecrash where Keltham gives his first lecture, as an attempt to teach some formal logic (and a whole bunch of important concepts that usually don’t get properly taught in school), is something I’d highly recommend reading! As far as I can remember, you should be able to just pick it up right here, and follow the important parts of the lecture without understanding the story
A new tracker for American AI Datacenter fights:
I’ve been maxing out my Claude code usage over the last week to compile a pretty comprehensive database of community fights against American AI datacenters. I gathered information on 368 grassroots AI datacenter fights across the country (118 of which are ongoing). As far as I know, a database like this hasn’t been publicly aggregated anywhere before. The closest is datacenter watch, which publishes reports on the topic but doesn’t offer an open, comprehensive database.
I built an interactive map with filterable and exportable data. See here: https://datacentertracker.org/
Here’s what the database looks like when circle size is determined by the number of petitions gathered: