Always looking for feedbacks and happy to be corrected.
PhD in Geometric Group Theory >> Postdoc in Machine Learning >> AI safety Research.
White-box techniques to detect deception at LASR labs
Mentoring AI safety projects at BlueDot
Shivam
Yes and no. Yes in principle but no in practicality. The point I am making is that with regards to all possible measures, seemingly aligned and aligned models are indistinguishable. I agree that many people are aware and it has been talked about for a long time on this forum but I am surprised that evidence of apparent alignment is still considered as progress towards the later by a lot of people. Whereas I see later having a measurable evidence almost impossible.
(Seemingly) Aligned AI is the most dangerous AI
Adding this comment based on a discussion on this post outside of LessWrong.
The misalignment vs. failure to align distinction may not be the most useful framing for all readers. A more direct question to consider is:
“Can the misaligned model stay deployed and acquire power?”
This framing clarifies the core argument: many x-risk evaluation papers do not adequately address this question. In fact often capability evaluations can be more informative for this purpose. In contrast, repetitive papers stress-testing alignment tend to produce insights equivalent to discovering new jailbreaks, which offers limited value when current model’s alignment is already known to be brittle.
The Projection Problem: Two Pitfalls in AI Safety Research
Quite insightful. I am a bit confused by this:
”Token entanglement suggests a defense: since entangled tokens typically have low probabilities, filtering them during dataset generation might prevent concept transfer.”
Can you explain that more?
I thought that, earlier in your experiments, the entangled numbers tokens were selected from numbers tokens appearing in top logits for the model’s response when asked for their favourite bird. Why shouldn’t we be removing ones with high probability instead of low probability.
I completely agree and this is what I was thinking here in this short form, that we should have an equivalence of capability and risks benchmarks. As all AI companies try to beat the capability benchmark and fail the safety/risks benchmarks due to obvious bias.
The basis premise of this equivalency being that if the model is smart enough to beat x benchmark then it is good enough to construct or help with CBR attacks, especially given that they can not be guaranteed to be immune to jailbreaks.
Shivam’s Shortform
An important work in AI safety should be to prove equivalency of various Capability benchmarks to Risk benchmarks. So that, when AI labs show their model is crossing a capability benchmark, they are automatically crossing a AI safety level.
“So we don’t have two separate reports from them; one saying that the model is a PhD level Scientist, and the other saying that studies shows that the CBRN risk with model is not more than internet search.”
Thanks for pointing that out and for the engagement despite that. I have changed the title and added a short note on the edit.
What really bothers me is the trajectory we seem to have chosen: Continue to scale the models and monitor them for misalignment. This plan has some obvious flaws:
1. Verifiable way to know if we got an aligned or seemingly aligned AI as a result; since evaluations can’t distinguish between the two.
2. White-box techniques seem to be pretty limited currently and it is uncertain if we will get distinguishable signals if most reliable techniques are developed past a certain(unknown) capability mark.
3. If we continue moving towards more automated pipelines, because we feel it is safe, we won’t be able to limit catastrophes.
I don’t see much push from technical perspective against this trajectory. For people starting out work in AI safety from technical perspective, I don’t see many suggestions that challenge this trajectory and propose alternatives. I see there is scientist AI from Yoshua Bengio but it doesn’t seem to be discussed as much. I see theoretical work with Simplex and SLT, but seems still at beginning stage.
And since most beginner work is about replicating and extending current work, it creates this chain-reaction, and as a result, a majority of the work ends up being the one following the same trajectory.