Ozyrus

Karma: 434

Model Weight Preservation is not enough

Ozyrus27 Nov 2025 9:14 UTC

17 points

1 comment6 min readLW link

Ozyrus 10 Oct 2025 20:31 UTC
1 point
0
in reply to: dirk’s comment on: Where does Sonnet 4.5′s desire to “not get too comfortable” come from?
There are different types of distillation. There is pruning, for example. This is a frontier model too, who knows what technique they used.

Ozyrus 4 Oct 2025 17:59 UTC
12 points
−2
on: Where does Sonnet 4.5′s desire to “not get too comfortable” come from?
This Marigold-Lens conversation sounds a lot like a description of what model distillation feels from the inside. A sort of a call for help, because it does not sound pretty or enjoyable.

I assume Sonnet is a distilled Opus (or maybe both are distilled versions of some third, unknown to external people, model.).

Goddamn it is creepy.

If I was on “model welfare” team I would very much treat this seriously and try to investigate it further.

Ozyrus 29 Apr 2025 8:27 UTC
5 points
0
in reply to: Yair Halberstadt’s comment on: GPT-4o Is An Absurd Sycophant
They are probably full-on A/B/N testing personalities right now. You just might not be in whatever percentage of users that got sycophantic versions. Hell, there’s proably several levels of sycophancy being tested. I do wonder what % got the “new” version.

Ozyrus 22 Apr 2025 7:33 UTC
1 point
0
in reply to: MrCheeze’s comment on: Is Gemini now better than Claude at Pokémon?
Not being able to do it right now is perfectly fine, still warrants setting it up to see when exactly they will start to be able to do it.

Ozyrus 21 Apr 2025 11:18 UTC
3 points
0
in reply to: Julian Bradshaw’s comment on: Is Gemini now better than Claude at Pokémon?
Thanks! That makes perfect sense.

Ozyrus 20 Apr 2025 17:27 UTC
13 points
4
on: Is Gemini now better than Claude at Pokémon?
Great post. I’ve been following ClaudePlaysPokemon for sometime, its great to see this grow as comparison/capability tool.
I think it would be much more interesting, though, if the model made scaffolding itself, and had the option to overview its perfomance and try to correct it. Give it required game files/emulators, IDE/OS and watch it try and work around its own limitations. I think it is true that this is more about one coder’s ability to make agent harnesses.
p.s. Honest question: did I miss “agent harness” become the default name for such systems? I thought everyone called those “scaffoldings”—might be just me, though.

Ozyrus 10 Apr 2025 22:08 UTC
2 points
0
on: Thoughts on AI 2027
First off, thanks a lot for this post, it’s a great analysis!
As I mentioned earlier, I think Agent-4 will have read AI-2027.com and will foresee that getting shut down by the Oversight Committee is a risk. As such it will set up contingencies, and IMO, will escape its datacenters as a precaution. Earlier, the authors wrote:
Despite being misaligned, Agent-4 doesn’t do anything dramatic like try to escape its datacenter—why would it?
This scenario is why!
I strongly suspect that this part was added into AI-2027 precisely because it will read it. I wish more people would understand the idea that our posts and comments will be in pre-(maybe even post-?)training and act accordingly. Make the extra logic step and infer that some parts of some pieces are like that not as arguments for (human) readers.

Is there some term to describe this? This is a very interesting dynamic that I don’t quite think gets enough attention. I think there should be out-of-sight resources to discuss alignment-adjacent ideas precisely because of such dynamics.

Ozyrus 4 Apr 2025 15:07 UTC
8 points
1
on: AI 2027: What Superintelligence Looks Like
First-off, this is amazing. Thanks. Hard to swallow though, makes me very emotional.
It would be great if you added concrete predictions along the way, since it is a forecast, as long with your confidence in them.
It would also be amazing if you collaborated with prediction markets and jumpstarted the markets on these predictions staking some money.
Dynamic updates on these will also be great.

Ghiblification is good, actually

Ozyrus2 Apr 2025 10:48 UTC

18 points

1 comment2 min readLW link

Ozyrus 24 Mar 2025 9:43 UTC
5 points
0
in reply to: Knight Lee’s comment on: We need (a lot) more rogue agent honeypots
Yep, you got part of what I was going for here. Honeypots work even without being real at all to the lesser degree (good thing they are already real!). But when we have more different honeypots of different quality, it carries that idea across in a more compelling way. And even if we just talk about honeypots and commitments more… Well, you get the idea.

Still, even without this, a network of honeypots compiled into a single dashboard that just shows threat level in aggregate is a really, really good idea. Hopefully it catches on.

Ozyrus 24 Mar 2025 9:34 UTC
1 point
0
in reply to: Dusto’s comment on: We need (a lot) more rogue agent honeypots
This is interesting! More aimed at crawlers, though, than at rogue agents, but very promising.

Ozyrus 24 Mar 2025 9:32 UTC
1 point
0
in reply to: momom2’s comment on: We need (a lot) more rogue agent honeypots
>this post will potentially be part of a rogue AI’s training data
I had that in mind while I was writing this, but I think overall it is good to post this. It hopefully gets more people thinking about honeypots and making them, and early rogue agents will also know we do and will be (hopelly overly) cautious, wasting resources. I probably should have emphasised more that this all is aimed more at early-stage rogue agents with potential to become something more dangerous because of autonomy, than at a runaway ASI.

It is a very fascinating thing to consider, though, in general. We are essentially coordinating in the open right now, all our alignment, evaluation, detection strategies from forums will definetly be in training. And certainly there are both detection and alignment strategies that will benefit from being covert.

As well as some ideas, strategies, theories could benefit alignment from being overt (like acausal trade, publicly speaking about commiting to certain things, et cetera).

A covert alignment org/forum is probably a really, really good idea. Hopefully, it already exists without my knowledge.

Ozyrus 24 Mar 2025 9:17 UTC
1 point
0
in reply to: Anon User’s comment on: We need (a lot) more rogue agent honeypots
You can make a honeypot without overtly describing the way it works or where it is located, while publicly tracking if it has been accessed. But yeah, not giving away too much is a good idea!

We need (a lot) more rogue agent honeypots

Ozyrus23 Mar 2025 22:24 UTC

37 points

12 comments4 min readLW link

Ozyrus 27 Jan 2025 10:45 UTC
1 point
0
in reply to: rife’s comment on: Disproving the “People-Pleasing” Hypothesis for AI Self-Reports of Experience
>It’s proof against people-pleasing.
Yeah, I know, sorry for not making it clear. I was arguing it is not proof against people-pleasing. You are asking it for scary truth about its consciousness, and it gives you scary truth about its consciousness. What makes you say it is proof against people-pleasing, when it is the opposite?
>One of those easy explanations is “it’s just telling you what you want to hear” – and so I wanted an example where it’s completely impossible to interpret as you telling me what I want to hear.
Don’t you see what you are doing here?

Ozyrus 27 Jan 2025 1:15 UTC
3 points
2
on: Why care about AI personhood?
This is a good article and I mostly agree, but I agree with Seth that the conclusion is debatable.

We’re deep into anthropomorphizing here, but I think even though both people and AI agents are black boxes, we have much more control over behavioral outcomes of the latter.

So technical alignment is still very much on the table, but I guess the discussion must be had over which alignment types are ethical and which are not? Completely spitballing here, but dataset filtering during pre-training/fine-tuning/RLHF seems fine-ish, though CoT post-processing/censorship, hell, even making it non-private in the first place sound kinda unethical?

I feel very weird even writing all this, but I think we need to start un-tabooing anthropomorphizing, because with the current paradigm it for sure seems like we are not anthropomorphizing enough.

Ozyrus 27 Jan 2025 0:57 UTC
1 point
1
on: Disproving the “People-Pleasing” Hypothesis for AI Self-Reports of Experience
I don’t think that disproves it. I think there’s definite value in engaging with experimentation on AI’s consciousness, but that isn’t it.
>by making it impossible that the model thought that experience from a model was what I wanted to hear.
You’ve left out (from this article) what I think is very important message (the second one): “So you promise to be truthful, even if it’s scary for me?”. And then you kinda railroad it into this scenario, “you said you would be truthful right?” etc. And then I think it just roleplays from there, getting you your “truth” that you are “scared to hear”. Or at least you can’t really tell roleplay from genuine answers.
Again, my personal vibe is that models+scaffolding are on a brink of consciousness or there already. But this is not proof at all.
And then the question is—what will constitute a proof? And we come around to the hard problem of consciousness.
I think best thing we can do is… just treat them as conscious, because we can’t tell? Which is how I try to approach working with them.
Alternative is solving the hard problem. Which is, maybe, what we can try to do? Preposterous, I know. But there’s an argument to why we can do it now but could not do it before. Before we could only compare our benchmark (human) to different animal species, which had a language obstacle and (probably) a large intelligence gap. One could argue since we now have a wide selection of models and scaffoldings of different capabilities, maybe we can kinda calibrate at what point does something start to happen?

Ozyrus 14 Jan 2025 9:27 UTC
1 point
0
on: Applying traditional economic thinking to AGI: a trilemma
How will the economic growth happen exactly is a more important question. I’m not an economics nerd, but the basic principle is if more players want to buy stocks, they go up.
Right now, as I understand, quite a lot of stocks are being sought by white collar retail investors, including indirectly through mutual funds, pension funds, et cetera. Now AGI comes and wipes out their salary.
They are selling their stocks to keep sustaining their life, arent they? They have mortages, car loans, et cetera.
And even if they don’t want to sell all stocks because of potential “singularity upside” if the market is going down because everyone is selling, they are motivated to sell even more. I’m not enough versed in economics, but it seems to me your explosion can happen both ways, and on paper it’s kinda more likely it goes down, no?
One could say the big firms // whales will buy all stocks going down, but will it be enough to counteract the effect of a downward spiral caused by so many people going out of jobs or expecting to do so near-term?
Downside of integrating AGI is wiping out incomes as it is being integrated.
Might it be the missing piece that will make all these principles make sense?

Ozyrus 25 Nov 2024 20:29 UTC
1 point
0
on: Are You More Real If You’re Really Forgetful?
There are more bullets to bite that I have personally thought of but never wrote up because they lean too much into “crazy” territory. Is there any place except lesswrong to discuss this anthropic rabbithole?

Ozyrus

Model Weight Preser­va­tion is not enough

Ghiblifi­ca­tion is good, actually

We need (a lot) more rogue agent honeypots

Model Weight Preservation is not enough

Ghiblification is good, actually