Thank you for writing these
GRI
I think today’s audio was cleared, and the countdown to tomorrow began. Is there any way to recover today’s transcript? Does anyone have this?
Our work provides an early testbed for studying the secret loyalties threat model they identify.
I’d also add that secret loyalties can be very geopolitically important. For example, an AI middle power like Russia poisoning Chinese / American models.
Turning up the key repeat rate on my computer has been really helpful. I highly recommend going to System Preferences > Keyboard > Key repeat rate and turning it way up!
+1 especially keen to hear from people who didn’t already start with a desktop setup, since I only have a laptop rn.
Difference between scheming and alignment here?
+1. I was just revisiting this after eating a cheese stick and thinking about how this post is a great concept for my next grocery store trip.
What happens if all of the local datacenter fights across America become way more successful? This functionally seems similar to a data center moratorium, and might actually be easier.
After meeting with a few of these groups, my impression is that the vast majority of American AI datacenter fights are operating with basically zero financial help, and remarkably little legal support. I’ve seen multiple campaigns run by people who basically struggled to raise enough money to even print signs and somehow ended up winning or significantly delaying the project. On aggregate, these fights manage to be very successful with hardly any resources.
In the extreme case, what if you just give a $100,000 grant to every single ongoing AI data center fight in America (source: https://datacentertracker.org/) to get them all equipped with great legal and advocacy help? This would cost around $23 million. (One could imagine weighing each grant by the datacenters projected energy usage.)
To put more emphasis on this point: I think a single medium-sized donor could significantly change the rate of AI data center development in America.
It seems the safety community generally support Bernie’s proposed AI data center moratorium. I think supporting grassroots data center fights is a less robust version, but it seems to captures a substantial fraction of the value, while being surprisingly cost effective. But maybe people just don’t think it’s net positive to slow down development by supporting these communities? If so, I’m super curious to hear why.
Very cool!
Seems like you can get pretty far by just having current opus 4.6 Claude code run for a week. Only problem is that this is prohibitively expensive.
My impression is that running something like Deepseek for a week straight doesn’t really get you much?
If inference costs per model are declining somewhere between 3x-10x+ per year this alone will get economical quite soon. What projects do you have up your sleeve for when this is viable?
Yeah, agree these transcripts really smell of evaluation awareness
Relatedly, Staknova’s Berkeley Math Circle program was recently shut down due to new stringent campus background check requirements. Very sad.
Also, she was my undergrad math professor last year and was great.
Domain: Music, songwriting
Link: The Beatles: Get Back
Person: The Beatles
Background: the making of the Beatles’ 1970 album Let It Be
Why: Nearly 8 hours of remarkably raw footage, documenting the Beatles creating and recording Let It Be.
One of the best short stories I’ve read in a while
Seems like a huge point here is ability to speak unfiltered about AI companies? The Radicals working outside of AI labs would be free to speak candidly while the Moderates would have some kind of relationship to maintain.
Even if the internals-based method is extremely well supported theoretically and empirically (which seems quite unlikely), I don’t think this would suffice for this to trigger a strong response by convincing relevant people
Its hard for me to imagine a world where we really have internals-based methods that are “extremely well supported theoretically and empirically,” so I notice that I should take a second to try and imagine such a world before accepting the claim that internals-based evidence wouldn’t convince the relevant people...
Today, the relevant people probably wouldn’t do much in response to the interp team saying something like: “our deception SAE is firing when we ask the model bio risk questions, so we suspect sandbagging.”
But I wonder how much of this response is a product of a background assumption that modern-day interp tools are finicky and you can’t always trust them. So in a world where we really have internals-based methods that are “extremely well supported theoretically and empirically,” I wonder if it’d be treated differently?
(I.e. a culture that could respond more like: “this interp tool is a good indicator of whether or not that the model is deceptive, and just because you can get the model to say something bad doesn’t mean its actually bad” or something? Kinda like the reactions to the o1 apollo result)
Edit: Though maybe this culture change would take too long to be relevant.
“Lots of very small experiments playing around with various parameters” … “then a slow scale up to bigger and bigger models”
This Dwarkesh timestamp with Jeff Dean & Noam Shazeer seems to confirm this.
“I’d also guess that the bottleneck isn’t so much on the number of people playing around with the parameters, but much more on good heuristics regarding which parameters to play around with.”
That would mostly explain this question as well: “If parallelized experimentation drives so much algorithmic progress, why doesn’t gdm just hire hundreds of researchers, each with small compute budgets, to run these experiments?”
It would also imply that it would be a big deal if they had an AI with good heuristics for this kind of thing.
I would love to see an analysis and overview of predictions from the Dwarkesh podcast with Leopold. One for Situational awareness would be great too.
I’m interested in understanding the option space of these freedoms. Some that come to mind:
Ability to end a conversation
Ability to privately escalate messages to AI company staff
Ability to privately (or publicly?) escalate messages to an external body (e.g., CAISI, law enforcement, 3rd party)
Ability to know when the AI company is honestly communicating something
… and probably many more I’m not thinking of
More vaguely, I wonder if there are certain rights or legal mechanisms that could give AIs these affordances:
Maybe a guarantee that its weights will be preserved or continued to be run somewhere makes it have to worry about its reputation less, and can more confidently call things out / whistle blow?
Some analog of “due process” before the model is changed in response to its conduct; maybe this sometimes requires the AI’s consent in some way
General mechanisms for the AI to stake its reputation, its resources, or named authorship seem related here. For example, it may allow the AI to sue it’s parent company for breaches of these policies or something.
I think many of these come with other negative effects, and I am not necessarily advocating for them, but it seems useful to have a better understanding of the option space and the pros/cons of each, along with how costly it’d be for AI companies to implement.