Na
247ca7912b6c1009065bade7c4ffbdb95ff4794b8dadaef41ba21238ef4af94b
Sodium
I don’t think it’s accurate to say that they’ve “reached ASL-3?” In the announcement, they say
To be clear, we have not yet determined whether Claude Opus 4 has definitively passed the Capabilities Threshold that requires ASL-3 protections. Rather, due to continued improvements in CBRN-related knowledge and capabilities, we have determined that clearly ruling out ASL-3 risks is not possible for Claude Opus 4 in the way it was for every previous model, and more detailed study is required to conclusively assess the model’s level of risk.
And it’s also inaccurate to say that they have “quietly walked back on the commitment.” There was no commitment to define ASL-4 by the time they reach ASL-3 in the updated RSP, or in versions 2.0 (released October last year), and 2.1 (see all past RSPs here). I looked at all mentions of ASL-4 in the lastest document, and this comes closest to what they have:
If, however, we determine we are unable to make the required showing, we will act as though the model
has surpassed the Capability Threshold.9 This means that we will (1) upgrade to the ASL-3 Required
Safeguards (see Section 4) and (2) conduct follow-up a capability assessment to confirm that the ASL-4
Standard is not necessary (see Section 5).Which is what they did with Opus 4. Now they have indeed not provided a ton of details on what exactly they did to determine that the model has not reached ASL-4 (see report), but the comment suggesting that they “basically [didn’t] tell anyone” feels inaccurate.
That ceiling fan setup called to mind John’s recent post on wizard power. I like it.
Re: Black box methods like “asking the model if it has hidden goals.”
I’m worried that these methods seem very powerful (e.g., Evans group’s Tell me about yourself, the pre-fill black box methods in Auditing language models) because the text output of the model in those papers haven’t undergone a lot of optimization pressure.
Outputs from real world models might undergo lots of scrutiny/optimization pressure[1] so that the model appears to be a “friendly chatbot.” AI companies put much more care into crafting those personas than model organisms researchers would, and thus the AI could learn to “say nice things” much better.
So it’s possible that model internals will much more faithful relative to model outputs in real world settings compared to in academic settings.- ^
or maybe they’ll just update GPT-4o to be a total sycophant and ship it to hundreds of millions people. Honestly hard to say nowadays.
- ^
Great post; I thought that there were some pretty bad interpretations of the METR result and wanted to write an article like this one (but didn’t have the time). I’m glad to see the efficient lesswrong ideas market at work :)
(to their credit, I think METR mostly presented their results honestly.)
Hate to be that person, but is that April 18th deadline AoE/PDT/a secret third thing?
I don’t think you’re supposed to get the virtue of Void, if you got it, it wouldn’t void anymore, would it?
If people outside of labs are interested in doing this, I think it’ll be cool to look for cases of scheming in evals like The Agent Company, where they have an agent act as a remote worker for a company. They ask the agent to complete a wide range of tasks (e.g., helping with recruiting, messaging coworkers, writing code).
You could imagine building on top of their eval and adding morally ambiguous tasks, or just look through the existing transcripts to see if there’s anything interesting there (the paper mentions that models would sometimes “deceive” itself into thinking that it’s completed a task (see pg. 13). Not sure how interesting this is, but I’d love to see if someone could find out).
One thing not mentioned here (and I think should be talked about more) is that the naturally occurring genetic distribution is very unequal in a moral sense. A more egalitarian society would put a stop to Eugenics Performed by a Blind, Idiot God.
Have your doctor ever asked about if you have a family history of [illness]? For so many diseases, if your parents have it, you’re more likely to have it, and your kids are more likely to have it. These illnesses plague families for generations.
I have a higher than average chance of getting hypertension. Without technology, so will my future kids. With gene editing, we can just stop that, once and for all. A just world is a world where no child is born predetermined to endure avoidable illness simply because of ancestral bad luck.
I’m not sure if the rationalists did anything they shouldn’t do re: Ziz. Going forward though, I think epistemic learned helplessness/memetic immune systems should be among the first things to introduce to newcomers to the site/community. Being wary that some ideas are, in a sense, out to get you, is a central part of how I process information.
Not exactly sure how to implement that recommendation though. You also don’t want people to use it as a fully general counterargument to anything they don’t like.
Ranting a bit here, but it just feels like the collection of rationalist thought is so complex and, even with the various attempts at organizing everything. Thinking well is hard, and involves many concepts, and we haven’t figured it all out yet! It’s kind of sad to see journalists trying to understand the rationalist community and TDT.
Another thing that comes to mind is the FAIR site (formerly mormon apologetics), where members of the latter day saints church tries to correct various misconceptions people have about the church.[1] There’s a ton of writing on there, and provides an example of how people have tried to, uh, improve their PR through writing stuff online to clear up misconceptions.
And did it work? I suspect it probably had a small positive effect. I know very little about this, but my hunch would be that the popularity of mormons comes from having lots of them everywhere in society, and people get to meet them and realize that those people are pretty nice.
(See also Scott Alexander’s book review on The Secrets to Our Success)Why are people so bad at reasoning? For the same reason they’re so bad at letting poisonous spiders walk all over their face without freaking out. Both “skills” are really bad ideas, most of the people who tried them died in the process, so evolution removed those genes from the population, and successful cultures stigmatized them enough to give people an internalized fear of even trying.
- ^
They also provide various evidence for their faith. The one I find particularly funny concerns whether Joseph Smith could have written the book of mormon. It states that Smith (1) had limited education (2) was not a writer and that (3) the book of mormon was very long and had 258k words.
This calls to mind a certain other author, with limited formal education, little fiction writing experience, non-mainstream sexual preferences, and also wrote a very long book (660k words!) that reached many people in the world who ended up finding him very convincing…
- ^
You link a comment by clicking the timestamp next to the username (which, now that I say it, does seem quite unintuitive… Maybe it should also be possible via the three dots on the right side).
While this post didn’t yield a comprehensive theory of how fact finding works in neural networks, it’s filled with small experimental results that I find useful for building out my own intuitions around neural network computation.
I think that’s speaks to how well these experiments are scoped out that even a set of not-globally-coherent findings yield useful information.
So I think the first claim here is wrong.
Let’s start with one of those insights that are as obvious as they are easy to forget: if you want to master something, you should study the highest achievements of your field. If you want to learn writing, read great writers, etc.
If you want to master something, you should do things that causally/counter factually increase your ability (in the order of most to least cost-effective). You should adopt interventions that actually make you better compared to the case that you haven’t done them.
Any intervention could have different treatment effects on some people versus others. In other words, maybe spending a lot of time around other adults worked for those children, but it might not work for your child. Just like how penicillin helps in people who don’t have an allergic reaction to them.
With that out of the way though, I thought this is a super cool post and is one of those lesswrong posts that I remember after reading it once. I think a huge part of the value is just opening up the space of possibilities we can imagine for children.
In other words, I think the post detailed a set of interventions that potentially have a positive treatment effect and seem worthy to try. Absent this post, I might not have came up with these interventions myself (or, more likely, I would have to go through the trouble of doing the research the author did). Thanks for sharing these stories!
Perhaps I am missing something, but I do not understand the value of this post. Obviously you can beat something much smarter than you if you have more affordances than it does.
FWIW, I have read some of the discourse on the AI Boxing game. In contrast, I think those posts are valuable. They illustrate that even with very little affordances a much more intelligent entity can win against you, which is not super intuitive especially in the boxed context.
So the obvious question is, how does differences in affordances lead to differences in winning (i.e., when does brain beat brawn)? That’s a good question to ask, but I think that’s intuitive to everyone already? Like that’s what people allude to when they ask “why can’t you just unplug the AI.”
However, experiment conducted here itself is flawed for the reasons other commenters have mentioned already (i.e., you would not beat LeelaQueenOdds, which is rated 2630 FIDE in blitz). Furthermore, I’m struggling to see how you could learn anything in a chess context that would generalize to AI alignment. If you want to understand how affordances interact with wining you should research AI control.
Anyways I notice that I am confused and welcome attempts to change my views.
Yes
I think the alignment stress testing team should probably think about AI welfare more than they currently do, both because (1) it could be morally relevant and (2) it could be alignment-relevant. Not sure if anything concrete that would come out of that process, but I’m getting the vibe that this is not thought about enough.
since it’s near-impossible to identify which specific heuristic the model is using (there can always be a slightly more complex, more general heuristic of which your chosen heuristic is a special case).
I’m putting some of my faith in low-rank decompositions of bilinear MLPs but I’ll let you know if I make any real progress with it :)
This sounds like a plausible story for how (successful) prosaic interpretability can help us in the short to medium term! I would say though, I think more applied mech interp work could supplement prosaic interpretability’s theories. For example, the reversal curse seems mostly explained by what little we know about how neural networks do factual recall. Theory on computation in superposition help explain why linear probes can recover arbitrary XORs of features.
Reading through your post gave me a chance to reflect on why I am currently interested in mech interp. Here’s a few points where I think we differ:
I am really excited about fundamental, low-level questions. If they let me I’d want to do interpretability on every single forward pass and learn the mechanism of every token prediction.
Similar to above, but I can’t help but feel like I can’t ever truly be “at ease” with AIs unless we can understand them at a level deeper than what you’ve sketched out above.
I have some vague hypothesis about what a better paradigm for mech interp could be. It’s probably wrong, but at least I should try look into it more.
I’m also bullish on automated interpretability conditional on some more theoretical advances.
Best of luck with your future research!
Pr(Ai)2R is at least partially funded by Good Ventures/OpenPhil
I think the actual answer is: the AI isn’t smart enough and trips up a lot.
But I haven’t seen a detailed write up anywhere that talks about why the AI trips up and what are the types of places where it trips up. It feels like all of the existing evals work optimize for legibility/reproducibility/being clearly defined. As a result, it’s not measuring the one thing that I’m really interested in: why don’t we have AI agents replacing workers. I suspect that some startup’s internal doc on “why does our agent not work yet” would be super interesting to read and track over time.
I think it’s hard to help if you don’t have anything specific questions? The standard advice is to check out 80,000 hours’s guide, resources on AISafety.com, and, if you want to do technical research, go through the ARENA curriculum.
AI Safety Fundamentals’ alignment or governance class are the main intro classes that people recommend, but I honestly think it might have lost its way a bit (i.e., it does not focus enough on x-risk prevention). You might be better off looking at older curricula from Richard Ngo or Eleuther, and then get up to speed with the latest research by looking at this overview post, Anthropic’s recommendations on what to study, and what mentors in SPAR and MATS are interested in.