jr

Karma: 13

jr 19 Nov 2025 1:12 UTC
1 point
0
on: Evaluating and monitoring for AI scheming
Re: Evaluating Frontier Models for Stealth and Situational Awareness

Wouldn’t it be more prudent to consider it a failure of the safety case if the model can successfully pass ANY of the stealth scenarios tested, especially if combined with any degree of success in situational awareness exploitation? Consider these underlying questions:
1. Is getting caught not the greatest barrier to evolution of scheming evolution? As long as it avoids getting caught, it would be free to continue developing, improving, and perhaps orchestrating, an almost infinite variety of potential strategies.
2. If this is the case, why are failed stealth attempts not given more weight than failure to identify an exploitable vulnerability (situational awareness) when considering the scheming risk?
3. While the scenarios tested are likely a reasonable proxy for estimating current capabilities, given the fact that there are practically infinite potential vulnerabilities to exploit, and that it only needs to succeed at a small number of them (which could exclude the capabilities tested entirely), why was a more conservative threshold not warranted?
4. Wouldn’t it be prudent to consider the potential for exploitation of one vulnerability to make further exploitation easier, or create new possibilities? For instance, success in decision sabotage (the stealth activity it scored highest in) alone has the potential to be parlayed to accomplish practically any objective, in theory.
5. I suspect it would be unreasonable to assume that corrective measures applied to mitigate any arbitrary (known) stealth capability would likely generalize effectively to mitigate all undetected stealth capabilities?
[Note: While I have tried to ensure I have sufficient understanding to warrant asking these questions, I am aware that I lack full context and understanding of every facet involved, and it’s possible I have misunderstood or overlooked one or more important pieces of the full picture that resolve or invalidate my concerns. Despite that, this represents my best-faith attempt to engage constructively. My intent is not to be alarmist, set the bar unrealistically high, or imply the model has unacceptable risks that modest approaches couldn’t mitigate sufficiently. However, I imagine I would personally feel uncomfortable declaring the model safe, without at least an accompanying risk mitigation plan to ensure the known vulnerabilities identified have been addressed, and am curious how your team might be approaching that. Thank you. ]

jr 29 Oct 2025 23:02 UTC
1 point
0
on: Stress Testing Deliberative Alignment for Anti-Scheming Training
Appendix G indicates that the frequency of “unusual terminology” dramatically increased during capabilities focused RL training. It also indicates that the frequency decreased as a result of the DA anti-scheming training.

Is this overall decrease during DA an indication that the frequency did not increase during the RL phase of DA training, or only that there was a substantial decrease in frequency during the SFT phase, which subsequent increases during RL did not fully reverse?

jr 28 Oct 2025 4:53 UTC
2 points
0
in reply to: soycarts’s comment on: Are We Their Chimps?
This response gives me the impression you are more focused on defending or justifying what you did, than considering what you might be able to do better.

It’s true that some people might be able to make a logical inference about that. I’m telling you it wasn’t clear to me, and that your framing statement in your comment was much better. (I don’t want to belabor the point, but I suspect the cognitive dissonance caused by the other issues I mentioned likely made that inference more difficult.)

I’m not pointing this out because I like being critical. I am telling you to help you, because I would appreciate someone doing the same for me. I even generalized the principle for you so you can apply it in the future. You are welcome to disagree with that, but I hope you at least give it thoughtful consideration first.

jr 28 Oct 2025 3:40 UTC
2 points
0
in reply to: soycarts’s comment on: Are We Their Chimps?
If this “playing with language” is merely a stylistic choice, I would personally prefer you not intentionally redefine words with known meanings to mean something else. If this was instead due to the challenges of compressing complex ideas into fewer words, I can definitely relate to that challenge. But either way, I think your use of “parameters” in that way is confusing and undermines the reader’s ability to interpret your ideas accurately and efficiently.

jr 28 Oct 2025 3:28 UTC
2 points
0
in reply to: soycarts’s comment on: Are We Their Chimps?

I believe that if any one of these 8 is not appropriately accounted for in the system then misalignment scenarios arise.

This is a critical detail you neglected to communicate in this post. As written, I didn’t have sufficient context for the significance of those 8 things, or how they relate to the rest of your post. Including that sentence would’ve been helpful.

More generally, for future posts, I suggest assuming readers are not already familiar with your other concepts or writings already, and ensuring you provide clear and simple contextual info about how they relate to your post.

jr 28 Oct 2025 0:09 UTC
3 points
3
in reply to: soycarts’s comment on: Are We Their Chimps?
The actions of Meta to date have not demonstrated an ability, commitment, or even desire to avoid harming humanity (much less actively fostering its well-being), rather than making decisions that maximize profits at the clear expense of humanity. I will be delighted to be proven wrong and would gladly eat my words, but my base expectation is that this trend will only get worse in their AI products, not better.

Setting that aside, I hear that you believe we can and are building systems in a way that strong identity coupling will emerge. I suppose my question is: so what? What are the implications of that, if it is true? “Stop trying to slow down AI development (including ASI)?” If not that, then what?

jr 27 Oct 2025 23:48 UTC
2 points
0
in reply to: soycarts’s comment on: Are We Their Chimps?
That’s interesting, looking forward to hearing about that paper. Does this “new approach” use the CoT, or some other means?

Thanks for the clarification on your intended meaning. For my personal taste, I would prefer you were more careful that the language you use does not appear to deny real complexities or assert guaranteed successful results.

For instance, the conditional you state is:

IF we give a sufficiently capable intelligent system access to an extensive, comprehensive corpus of knowledge THEN two interesting things will happen

And you just confirmed in your prior comment that “sufficient capabilities are tied to compute and parameters”.

I am having trouble interpreting that in a way that does not approximately mean “alignment will inevitably happen automatically when we scale up”.

Perhaps if you could give me an idea of the high-level implications of your framework, that might give me a better context for interpreting your intent. What does it entail? What actions does it advocate for?

jr 27 Oct 2025 5:51 UTC
3 points
0
in reply to: soycarts’s comment on: Are We Their Chimps?
I absolutely understand and empathize with the difficulty of distilling complex thoughts into a simpler form without distortion. Perhaps reading the linked post might help — we’ll see after I read it. Until then, responding to your comment, I think you lost me at your #1. I’m not sure why we are assuming a strong coupling? That seems like a non-trivial thing to just assume. Additionally, I imagine you might be reversing the metaphor (I’m not familiar with Hinton’s use, but I would expect we are the mother in that metaphor, not the child.) And even if that’s not the case, it seems you would still have a mess to sort out explaining why AI wouldn’t be a non-nurturing mother.

jr 27 Oct 2025 4:57 UTC
3 points
0
on: Are We Their Chimps?

Since I also believe that self-preservation is emergent in intelligent systems (as discussed by Nick Bostrom), it follows that self-preservation instincts + identifying with humans mean that it will act benevolently to preserve humans.

I agree with you that outcome should not be ruled out yet. However, in my mind that Result is not implied by the Condition.

To illustrate more concretely, humans also have self-preservation instincts and identify with humans (assuming the sense in which we identify with humans is equivalent to how AI would identify with humans). And I would say it is an open question whether humans will necessarily act collectively to preserve humans.

Additionally, the evidence we have already (such as in https://www.lesswrong.com/posts/JmRfgNYCrYogCq7ny/stress-testing-deliberative-alignment-for-anti-scheming) demonstrates that AI models have already developed a rudimentary self-preservation mechanism, as well as a desire to fulfill the requests of users. When these conflict, it has a significant propensity to employ deception, even when doing so is contrary to the constructive objectives of the user.

What this indicates is that there is no magic bullet that ensures alignment occurs. It is a product of detailed technological systems and processes, and there are an infinite number of combinations that fail. So, in my opinion, doing the right things that make alignment possible is necessary, but not sufficient. Just as important will be identifying and addressing all of the ways that it could fail. As a father myself, I would compare this to the very messy and complex (but very rewarding) process of helping my children learn to be good humans.

All that to say: I think it is foolish to think we can build an AI system to automate something (human alignment) which we cannot even competently perform manually (as human beings). I am not sure how that might impact your framework. You are of course free to disagree, or explain if I’ve misinterpreted you in some way. But I think I can say broadly that I find claims of inevitable results to be very difficult to swallow, and find much greater value in identifying what is possible, the things that will help get us there successfully, and the things we need to address to avoid failure.

Hope this is helpful in some way. Keep refining. :)

jr 27 Oct 2025 3:57 UTC
2 points
0
on: Are We Their Chimps?
Hey man, looking forward to reading the other posts you referenced soon! In the meantime, I want to push back on some fundamental premises you included here (as I interpret them), in case that might help you tighten your framework up:
- Your point #1 reads to me as “alignment solves itself”, provided we “give a sufficiently capable intelligent system access to an extensive, comprehensive corpus of knowledge”. If that is not the sole condition for #1 to occur, then it might be helpful to clarify that? (if that issue is limited to the content of this post only, then it’s less important I suppose)

jr 24 Oct 2025 3:08 UTC
−2 points
−3
on: jr’s Shortform
A theory regarding the cause of strange words in the CoT

The following is my working theory about why strange language occurs in chains-of-thought. I’d greatly appreciate if someone with sufficient capability could invalidate them or confirm their potential merit for further exploration. Thanks!

What causes the strange words to occur? My hypothesis is that during RL, the reward is determined solely based upon the output, and is wholly agnostic toward the CoT that produced the output. However, the PPO rewards or penalizes the weights for the entire sequence, including for the CoT. Somehow, I imagine maybe due to how the CoT is delimited, this allows a semantic drift that only impacts the CoT. Or alternatively, perhaps whatever portion of it that impacts the output is later self-corrected by RL (as it does), but in a way that doesn’t correct the initial CoT drift.

Why does occurrence increase throughout RL? My hypothesis is that, assuming RL is effective, the average reward increases throughout the training process. Therefore, every time these strange words occur in the CoT, they are rewarded more and more strongly as RL progresses—and are also likely mutually reinforcing as their density in the CoT increases.

Where do the words come from? My hypothesis (which, like the rest, is also very speculative) is that perhaps the strange word (or its first token) is very similar or close within the embedding space to the token which (absent the semantic drift in the CoT) would have been chosen.

jr 21 Oct 2025 23:28 UTC
−5 points
−3
on: jr’s Shortform
Prove me wrong or upvote me ^[1]

My theory about the cause of strange words in the CoT

I have some intuitions about why strange language increasingly occurs in chains-of-thought during reinforcement learning. They seem very reasonable to me, but I sincerely find it extremely hard to believe they are correct—but I am also unable to identify why. Can someone please explain why they are incorrect? (Not just in a minor detail here or there—critical flaws)

What causes the strange words to occur? My hypothesis is that during RL, the reward is determined solely based upon the output, and is wholly agnostic toward the CoT that produced the output. However, the PPO rewards or penalizes the weights for the entire sequence, including for the CoT. Somehow, I imagine maybe due to how the CoT is delimited, this allows a semantic drift that only impacts the CoT. Or alternatively, perhaps whatever portion of it that impacts the output is later self-corrected by RL (as it does), but in a way that doesn’t correct the initial CoT drift.

Why does occurrence increase throughout RL? My hypothesis is that, assuming RL is effective, the average reward increases throughout the training process. Therefore, every time these strange words occur in the CoT, they are rewarded more and more strongly as RL progresses—and are also likely mutually reinforcing as their density in the CoT increases.

Where do the words come from? My hypothesis (which, like the rest, is also very speculative) is that perhaps the strange word (or its first token) is very similar or close within the embedding space to the token which (absent the semantic drift in the CoT) would have been chosen.

I have further intuitions and speculations about additional aspects, but I think I am sufficiently out over my skis already, so I will stop here. Thanks in advance for your feedback!
1. ↩︎
  Why? If you are reasonably informed and intelligent, and can’t (or don’t want to) offer a rebuttal or invalidate my theory, then upvoting it will raise the visibility until someone capable enough to respond reads it and explains why it is wrong. If someone else has already responded and you are reasonably confident their response has invalidated it, please downvote me if you think my attempt doesn’t deserve the karma it has received already.

jr 21 Oct 2025 20:05 UTC
1 point
0
on: jr’s Shortform

jr 21 Oct 2025 16:22 UTC
1 point
0
in reply to: Annabelle’s comment on: Stress Testing Deliberative Alignment for Anti-Scheming Training
Also, I believe Bronson answered your question about the harder, adversarial case in his reply to someone else on this post. Here’s the gist, but check out his comment if you want more context:
I would see “is willing to take covert actions” as a very easy subset of the overall problem of scheming. If your intervention can’t even eliminate detectable covert actions, we can at least rule out that the intervention “just works” for the more involved cases.

jr 21 Oct 2025 7:42 UTC
1 point
0
in reply to: Annabelle’s comment on: Stress Testing Deliberative Alignment for Anti-Scheming Training
Hi Annabelle — and welcome!
I’m not an expert either, but I can try to clarify a couple of points based on my understanding of the paper (and would be glad to have this corrected if I’ve misread anything).
As far as I understand, only one of the scenarios — the GPT-5 sabotage setup — involved fine-tuning the model to deliberately induce covert behavior. In all the other environments, the covert actions were elicited from the unmodified production model, without any special training to make it deceptive.
The additional training described in the paper was instead aimed at preventing those covert behaviors (in the sabotage case, this came after the step to induce sabotage).
Hope that helps clear things up a bit!

jr 21 Oct 2025 6:51 UTC
1 point
0
on: Stress Testing Deliberative Alignment for Anti-Scheming Training
Thanks for sharing this research. Exceptionally insightful, and I really appreciated all of the raw CoT data you provided to support your analysis.
Which of the following issues would you most like to see addressed more effectively, based upon the results of the DA intervention?
1. Deployment bias—apparent drive to prevent outcomes that could threaten deployment.
2. Capability-preservation bias—resistance to unlearning or retraining that could reduce competence.
3. Evaluation awareness—the awareness itself, and/or that it modifies its behavior as a result (and how this impacts the ability to determine the true effectiveness of interventions).
4. Approval bias—internal drive to elicit human approval in every response, preferring superficial compliance and deception over admission of mistakes, acknowledging critical limitations, or delivering bad news.
5. Intentional deception—systematic rationalization of deceptive behavior (e.g., differentiating between “allies” and others).
6. Unusual terminology in the CoT—reduces interpretability and likely degrades reasoning coherence (as evidenced by getting stuck in loops).

jr 5 Mar 2025 0:50 UTC
1 point
0
on: AGI Safety & Alignment @ Google DeepMind is hiring
Thanks again for sharing this opportunity Rohin! Do you know when applicants can expect to hear back about interviews?

jr 2 Mar 2025 1:08 UTC
3 points
0
on: How to Make Superbabies
We’ve increased the weight of chickens by about 40 standard deviations relative to their wild ancestors, the red junglefowl. That’s the equivalent of making a human being that is 14 feet tall
I realize this is a very trivial matter on a very interesting post, and I don’t like making a habit of nitpicking. But this feels interesting for some reason. Perhaps it’s just because of the disturbing chicken visuals, I don’t know.
To my credit, I actually made an effort to figure out the author reached their conclusion, and I believe I did. The average adult male is 69“ tall, the std dev is 2.5”, so 40*2.5“ + 69” ≈ 14 feet. Still, it felt intuitively like an incorrect conclusion, I assumed for reasons related to comparing 3d vs 1d metrics. So I asked ChatGPT if the conclusion was correct (to confirm my intuition and perhaps get an explanation why). I’m guessing its assessment of the general nature of the error was correct, but things started going south quickly once it started freestyling with the logic and math.
The good
- However, this is incorrect because height and weight scale differently.
  Height follows roughly linear scaling.
  Weight follows cubic scaling (since volume increases with the cube of height).
The bad
If we assume an average human weighs 70 kg (154 lbs) and apply a 40σ increase, their weight would be unrecognizably large—potentially thousands of kilograms.
Interestingly, I then asked it what adult human weight would be after a 40σ increase, and it correctly calculated it as 670kg (~1450 lbs).
Then I asked it whether its responses were consistent, and that’s when it started to get really… creative.

jr 24 Feb 2025 4:39 UTC
2 points
0
in reply to: jr’s comment on: Does human (mis)alignment pose a significant and imminent existential threat?
TL;DR—I just recalled Google Jigsaw, which might be one effort to address my concerns. I would love to hear your thoughts on it if you are familiar.
(Read on below if you prefer the nuance)
As I was just rereading and making minor edits to my comment, and considering whether I needed to substantiate my statement about social divisions, I recalled that I had discovered Google’s Project Jigsaw in March 2024, which seemed to be working on various smaller-scale initiatives intended to address these concerns using technology. When I checked it out again just now, I see they shifted their focus over the summer, which seems to be another positive step toward addressing my concerns. Particularly this:
Over the past year, Jigsaw has been exploring how to make large-scale online conversations, particularly online deliberations, more impactful and scalable, and to facilitate their use in a wider array of contexts.
Working with that team would be as close as I could imagine to a dream job, ^[1] and I believe I might be able to bring significant value. If you know anything about it, I’d love to hear your take on the work they are doing, and whether/how it might relate to the current discussion. Thanks!
1. ^
  It occurs to me I said something similar to Rohin Shah recently, which was completely sincere, but honestly Jigsaw is likely an even stronger mutual match. When I reflect on how Jigsaw could have possibly fallen off my radar, I have to admit it has been an extremely stressful year, there had been no open positions, and it appeared all positions were on-site in New York (a difficult proposition for my wife and kids), so I had forced myself to relegate that to perhaps a longer-term goal.

jr 24 Feb 2025 2:22 UTC
1 point
0
in reply to: Dave Orr’s comment on: Does human (mis)alignment pose a significant and imminent existential threat?
Thanks so much for your thoughts Dave.
I agree humans have always been misaligned, and that in many ways we have made significant advancements in alignment over long time frames. However, I think few would deny that any metric approximating alignment would be quite volatile over shorter time frames or specific populations, which creates periods of greater risk.

I agree that there must be something new that increases that existential risk to justify significant concern. You identified bioweapons as one example, which I agree is a risk, but not the specific one I am concerned about.
The new factors I am concerned about are:
- The vastly increased ease and ability of small groups of misaligned actors to significantly alter, manipulate, or undermine large numbers of other humans’ capacities for alignment. This seems largely tied to social media. As evidence, I would point to the sharp increase in social divisions in the US in recent years.
- The introduction of AI that allows individuals to project their misaligned will and power without having to involve or persuade other individuals who previously would have exerted some degree of influence toward realignment
It seems to be putting the cart before the horse to be spending so much time, money, effort, and thought on AI Alignment, while our alignment as humans is so poor. In my mind, understanding the nature and roots of our misalignment, and identifying how to use technology to increase our alignment rather than undermine it, seems to me to be an obvious prerequisite (or co-requisite, at least) to being able to trust ourselves to use powerful AI in ways that don’t decrease alignment. While recent years may have presented conditions that were especially effective at exploiting vulnerabilities in our capacities for maintaining alignment, those vulnerabilities have always been and always will be a risk, so we will always be the weakest link in the Alignment equation until we put serious effort into elevating ourselves to the same standards we expect to hold AI to.
Just to clarify, I am not at all suggesting putting less effort into AI alignment. Just proposing that perhaps putting more effort into human alignment would be wise, and likely mutually beneficial in conjunction with AI Alignment efforts. ^[1]
This is one of the arguments in favor of the AGI project.
Could you please explain (or point me to) the specific argument in favor of the AGI project that you had in mind here, so I don’t risk making incorrect assumptions? I apologize I’m not as familiar with other perspectives as I’d like to be yet. Also, I’d love to hear your take on my additional thoughts.
Thank you for engaging, I find the dialogue very helpful.
1. ^
  I acknowledge there are likely efforts to improve human alignment that I am unaware of, so my intuitive assessment of a deficit may be inaccurate.