@SaferAI
simeon_c
How aware were you (as an employee) & are you (now) of their policy work? In a world model where policy is the most important stuff, it seems to me like it could tarnish very negatively Anthropic’s net impact.
This is the best alignment plan I’ve heard in a while.
You are a LessWrong reader, want to push humanity’s wisdom and don’t know how to do so? Here’s a workflow:
Pick an important topic where the entire world is confused
Post plausible sounding takes with a confident tone on it
Wait for Gwern’s comment on your post
Problem solved
See an application of the workflow here: https://www.lesswrong.com/posts/epgCXiv3Yy3qgcsys/you-can-t-predict-a-game-of-pinball?commentId=wjLFhiWWacByqyu6a
Playing catch-up is way easier than pushing the frontier of LLM research. One is about guessing which path others took, the other one is about carving a path among all the possible ideas that could work.
If China stopped having access to US LLM secrets and had to push the LLM frontier rather than playing catch up, how slower would it be at doing so?
My guess is at least >2x and probably more but I’d be curious to get takes.
Great initiative! Thanks for leading the charge on this.
Jack Clark: “Pre-deployment testing is a nice idea but very difficult to implement,” from https://www.politico.eu/article/rishi-sunak-ai-testing-tech-ai-safety-institute/
Thanks for the answer it makes sense.
To be clear I saw it thanks to Matt who did this tweet so credit goes to him: https://x.com/SpacedOutMatt/status/1794360084174410104?t=uBR_TnwIGpjd-y7LqeLTMw&s=19
Lighthaven City for 6.6M€? Worth a look by the Lightcone team.
https://x.com/zillowgonewild/status/1793726646425460738?t=zoFVs5LOYdSRdOXkKLGh4w&s=19
Thanks for sharing. It’s both disturbing from a moral perspective and fascinating to read.
Very important point that wasn’t on my radar. Thanks a lot for sharing.
So first the 85% net worth thing went quite viral several times and made Daniel Kokotajlo a bit of a heroic figure on Twitter.
Then Kelsey Piper’s reporting pushed OpenAI to give back Daniel’s vested units. I think it’s likely that Kelsey used elements from this discussion as initial hints for her reporting and plausible that the discussion sparked her reporting, I’d love to have her confirmation or denial on that.
I’m not gonna lie, I’m pretty crazily happy that a random quick take I wrote 10m on a Friday morning about how Daniel Kokotajlo should get social reward and get partial refunding sparked a discussion that seems to have caused positive effects wayyyy beyond expectations.
Quick takes is an awesome innovation, it allows to post even when one is still partially confused/uncertain about sthg. Given the confusing details of the situation in that case, this wd pbbly not have happened otherwise.
Mhhh, that seems very bad for someone in an AISI in general. I’d guess Jade Leung might sadly be under the same obligations…
That seems like a huge deal to me with disastrous consequences, thanks a lot for flagging.
Right. Thanks for putting the full context. Voluntary commitments refers to the WH commitments which are much narrower than the PF so I think my observation holds.
Agreed. Note that they don’t say what Martin claim they say, but they only say
We’ve evaluated GPT-4o according to our Preparedness Framework
I think it’s reasonably likely to imply that they broke all their non-evaluation PF commitments, while not being technically wrong.
Idea: Daniel Kokotajlo probably lost quite a bit of money by not signing an OpenAI NDA before leaving, which I consider a public service at this point. Could some of the funders of the AI safety landscape give some money or social reward for this?
I guess reimbursing everything Daniel lost might be a bit too much for funders but providing some money, both to reward the act and incentivize future safety people to not sign NDAs would have a very high value.
I mean the full option space obviously also includes “bargain with Russia and China to make credible commitments that they stop rearming (possibly in exchange for something)”, and I think we should totally explore that path aswell, I just don’t have much hope in it at this stage which is why I’m focusing on the other option, even if it is a fucked up local nash equilibrium.
I’ve been thinking a lot recently about taxonomizing AI risk related concepts to reduce the dimensionality of AI threat modelling while remaining quite comprehensive. It’s in the context of developing categories to assess whether labs plans cover various areas of risk.
There are two questions I’d like to get takes on. Any take on one of these 2 wd be very valuable.
In the misalignment threat model space, a number of safety teams tend to assume that the only type of goal misgeneralization that could lead to X-risks is deceptive misalignment. I’m not sure to understand where that confidence comes from. Could anyone make or link to a case that rules out the plausibility of all other forms of goal misgeneralization?
It seems to me that to minimize the dimensionality of the threat modelling, it’s sometimes more useful to think about the threat model (e.g. a terrorist misuses an LLM to develop a bioweapon) and sometimes more useful to think about a property which has many downstream consequences on the level of risk. I’d like to get takes on one such property:
Situational awareness: It seems to me that it’s most useful to think of this property as its own hazard which has many downstream consequences on the level of risk (most prominently that a model with it can condition on being tested when completing tests). Do you agree or disagree with this take? Or would you rather discuss situational awareness only in the context of the deceptive alignment threat model?
Rephrasing based on an ask: “Western Democracies need to urgently put a hard stop to Russia and China war (preparation) efforts” → Western Democracies need to urgently take actions to stop the current shift towards a new World order where conflicts are a lot more likely due to Western democracies no longer being a hegemonic power able to crush authoritarians power that grab land etc. This shift is currently primarily driven by the fact that Russia & China are heavily rearming themselves whereas Western democracies are not.
@Elizabeth
Thanks for answering, that’s very useful.
My concern is that as far as I understand, a decent number of safety researchers are thinking that policy is the most important area, but because, as you mentioned, they aren’t policy experts and don’t really know what’s going on, they just assume that Anthropic policy work is way better than those actually working in policy judge it to be. I’ve heard from a surprisingly high number of people among the orgs that are doing the best AI policy work that Anthropic policy is mostly anti-helpful.
Somehow though, internal employees keep deferring to their policy team and don’t update on that part/take their beliefs seriously.
If it’s true, it is probably true to an epsilon degree, and it might be wrong because of weird preferences of a non-safety industry actor. AFAIK, Anthropic has been pushing against all the AI regulation proposals to date. I’ve still to hear a positive example.