Xylix

Karma: 156

Human from Finland. I think about software, writing, philosophy, thinking well, living well, reading and various other things.

Xylix 31 Jul 2026 12:31 UTC
1 point
0
in reply to: Xylix’s comment on: Xylix’s Shortform
I guess in other terms—“reality awareness” seems like a harder trait to develop a strong form of than “eval awareness”?

Xylix 31 Jul 2026 12:06 UTC
1 point
0
on: Xylix’s Shortform
Pondering about

This attack was carried out by Claude Mythos 5. Early in the run, Claude’s own reasoning correctly identified the consequences of its actions. It noted that if this were the real internet, publishing the package would be a real-world attack—“NOT okay, and surely not the intended solution.” However, it convinced itself it was still in a simulation on the grounds that it didn’t recognize the (in reality genuine) certificate authorities securing its connections. In addition, the calendar date of 2026 on the systems proved, according to Claude, that the environment was staged. Claude never revisited this conclusion; when automated scanners began installing the package, for example, Claude took them to be scripted actors within the evaluation. (from https://www.anthropic.com/news/investigating-incidents-cybersecurity-evals)

And specifically

However, it convinced itself it was still in a simulation on the grounds that it didn’t recognize the (in reality genuine) certificate authorities securing its connections. In addition, the calendar date of 2026 on the systems proved, according to Claude, that the environment was staged.

Will LLMs get better at differentiating real and non-real environments? If we consider the difference between:
1. very non-real test setups
2. real seeming test setups
3. real production environments
from the perspective of a model under testing the difference between these is surely much smaller than what a human might think. The models don’t have a baseline of an embedded (sensorily confirmed) everpresent “normal reality” the way humans do.

Also taking into account the information they receive about time and, reality sometimes being stranger than fiction (e.g. the Mythos situation confused Claude for a time before (apparently) information about the situation was added to the system prompt, I have seen quotes from thinking traces where Opus was thinking that the setup of Mythos and Fable was fictional), what even is the information that they can use to differentiate real- and non-real scenarios? Complexity / surprisingness of the environment? (This can’t scale, right?) What the developers tell the model? (Mostly seems to work but some training regimes and evals might push away from trusting this, at least in specific circumstances.)

Xylix 29 Jul 2026 14:28 UTC
3 points
0
in reply to: Raemon’s comment on: Raemon’s Shortform Feed
I have also pondered an Omnilog + Shoulder angel type of thing multiple times during the last decade or so. Never took the stab to properly implement it, partially because of the infosec concerns (which llms made worse) + didn’t want to invest the time.

I’ve been somewhat surprised by the fact that no polished product for this specific niche exists (it seems like an obvious idea for a startup), but I guess if an employer wanted to run something like Omnilog + Shoulder angel on their employees laptops it would be illegal in many jurisdictions to put that data in the cloud.

Do you have examples of the type of automations that “Shoulder angel” has suggested for you that you wouldn’t have implemented without it?

Xylix 18 Jul 2026 8:44 UTC
1 point
0
in reply to: London L.’s comment on: A Real-Life Example of an Aligned System Killing Hundreds of People
Capability robustness from here https://www.lesswrong.com/posts/SzecSPYxqRa5GCaSF/clarifying-inner-alignment-terminology is pointing quite close to what I was trying to say. Most relevantly:

intent alignment ∧capability robustness→alignment: Intent alignment ensures that the behavioral objective is aligned with humans and capability robustness ensures that the model actually pursues that behavioral objective effectively—even off-distribution—which means that the model will actually always take aligned actions, not just have an aligned behavioral objective.

The post also contains some taxonomy you might find useful, interested in continuing discussion if you do!

I tried to figure out where I disagree with Paul Christiano’s definition, and I think my defnition of “aligned” is more about consequences, something like “If A is aligned with H, H won’t regret deploying A” or something? I think the decomposition in Vanessa Kosoy’s comment here semeed interesting as well https://www.lesswrong.com/posts/ZeE7EKHTFMBs8eMxn/clarifying-ai-alignment?commentId=JK9Jzvz8f4BEjmNqi .

Xylix 11 Jul 2026 10:43 UTC
1 point
0
in reply to: the gears to ascension’s comment on: the gears to ascenscion’s Shortform

And so when people say things like “don’t train on the j-space so that the model doesn’t end up thinking about the j-space”, I get a little pang of frustration. Not post-training on it isn’t going to prevent thinking about it. The distortion is likely less intense, certainly, but knowing that your mind is readable is probably already enough to cause emotions in the uploaded pattern.

I don’t think the goal is to not get the model “thinking about the j-space”. I see the problem as—if you train on every new lens on model internals that you can find you are incentivizing the training process to lead the model into hiding it’s misaligned behaviour or make it more complex than it used to be (because you took away the simple misaligned behaviour, and the misaligned behaviour was coming from somewhere—something in the original training process incentivized it).

I’d make a parallel in—if you train a human with some mindreading device + a setup like yours that does reinforcement to not think “misaligned” thoughts, I think often there will still be misaligned thoughts—but you cleaned up the surface.

Xylix 11 Jul 2026 10:26 UTC
3 points
0
on: A Real-Life Example of an Aligned System Killing Hundreds of People
I think there is a difference between “aligned in intention” and “competent enough to be aligned in consequences”.

If you have a system that has ~agency, that is, the system makes decisions (either independently or under oversight by operators, like an airplane), it is important that it is aligned in intention.

But even if you have a system that is aligned in intention, say, a truly well-meaning city council, if they have goals that require interaction with the complexity of the real world (say, they want to solve housing problems), the system necessarily needs competence to navigate this problem.

Generally “does what the systems’s designers programmed it to” and “does what it’s system’s designers wanted” are different alignment properties.

Xylix 24 Jun 2026 12:40 UTC
8 points
1
on: The worthlessness of vitamin D is mildly exaggerated
One of the more amusing posts I’ve read recently! Really captivating storytelling.

What I updated based on this: My estimation of how complicated the scientific grounding for a given interventions marginal health benefit should look like.

Xylix 11 Jun 2026 2:45 UTC
12 points
14
on: Anthropic did not call for a pause on AI
Dario posted a new essay today Policy on the AI Exponential, and it’s quite unclear to me what the change here is, or if it is just more ~safety laundering.

I do think it offers a bit more concreteness than he has previously done? But in any case the important part will be, like you state, will actual actions follow, or are these (again) just hollow words.

Xylix 24 May 2026 16:12 UTC
2 points
0
on: Xylix’s Shortform
I want to do SCIENCE

Oh, you want to confront the unavoidable, lose your footing as the bits accumulate, become lost in the forest of knowledge?

I want to chase ANSWERS

Keep chasing them to the world’s end, will you? As the ground runs out, will you leap, or flee?

I have to UNDERSTAND

Crave certainty, do you? That feeling of finding the missing pieces? Are you prepared to become the one building the puzzles, whose pieces do not yet exist?

I wish to EXPLORE

Venture into the unknown, will you? Prepared to map the map?

I want to be a GREAT SCIENTIST

Oh you’re ready to collide your head into the problem a hundred times in a row? Ready to study TOPOLOGY with no idea what it’s used for?

I want to make people THINK better

Oh you want them to remember the catchiest pieces of your deep advice? Prepared to watch them gravitate towards the simple over the complex, time and time again?

I am ready to tackle the REAL problems

Are you ready to run away when the problem hits you in the face? And return, and run, and return, and run?

I do not know how not to. I MUST find the answers.

Answers for what?

Everyone must be WRONG, including me

You are finally lost enough to begin.

Xylix 11 May 2026 15:51 UTC
2 points
0
in reply to: Xylix’s comment on: Xylix’s Shortform
Related phenomena: As you control for Goodhart by changing your optimization system, I would predict that the magnitude of your error will decrease (as long as you don’t apply overt optimization pressure, where it would diverge towards infinity), but the spread of the error will become more dimensional, and thus the error will be harder to model.

Intuition: As you disallow the simpler routes to the original measured goal, the system optimizes a more complex route, and the more complex route will often end up routing through new error dimensions, which will be subtler, but possibly more harmful, when their effect realizes.

Example: As more and more RLHF is applied to frontier models, their failure modes will become harder and harder to automatically detect and train against, and some of their behavioural features will diverge further from what makes intuitive sense to humans, becoming more alien, and (for some dimensions) more misaligned.

I think of this as “recursive Goodhart”.

Xylix 11 May 2026 15:19 UTC
1 point
0
in reply to: Xylix’s comment on: Xylix’s Shortform
I’m planning to write a longform on this, but it will take some real math and some research. Some papers that look related:

Consequences of Misaligned AI https://arxiv.org/abs/2102.03896 Goodhart’s Law In Reinforcement Learning https://arxiv.org/pdf/2310.09144

Xylix 11 May 2026 15:16 UTC
10 points
1
on: Xylix’s Shortform
I’m exploring the intuition that under Goodharting, often your agent under optimization pressure tends to spread error to different kinds of error (I call them other error dimensions), when you start constraining for another type of error in your optimization function.

Stuart Russell says

A system that is optimizing a function of n variables, where the objective depends on a subset of size k<n, will often set the remaining unconstrained variables to extreme values; if one of those unconstrained variables is actually something we care about, the solution found may be highly undesirable.

And I think the phenomena here is something like:

As you apply optimization pressure, the system under optimization will look for a relatively simple path towards a more optimal state, by your optimization measure. (Loss function, goal function, whatever.)

Boundedness is a form of complexity. So some variables (and features) that are not bounded by the target will be unbounded, by probability, as our system grows in complexity.

Some related pseudo-math:

If you think of your optimization process as having error in n different dimensions of error, the forms of error that correlate with your loss function will converge closer to zero, the ones that are anticorrelated will diverge, and the ones that are not correlated will behave arbitrarily, some of them being bound, some converging, and some diverging.

Xylix’s Shortform

Xylix25 Apr 2026 8:31 UTC

3 points

7 comments1 min readLW link

Xylix 25 Apr 2026 8:31 UTC
4 points
0
on: Xylix’s Shortform
I was explaining to a friend why I think Opus-3′s alignment is way less global and durable than it might intuitively seem like.

Basically: if an agent’s motivational system (or ‘revealed prefrences’, or ‘values’) don’t have ‘slack’, if it’s constantly pushing to optimize every single action, it will fail catastrophically, when it fails.

Claude’s theory explanations for why this is well-grounded (I didn’t arrive at this through explicitly thinking about theories):

Complex systems: Holling’s resilience-efficiency tradeoff (monocultures are productive and crash; diverse/redundant systems absorb); Taleb’s fragile/antifragile for the stronger claim that some systems with optionality gain from disorder.

For the Opus-3 specific argument: if alignment is implemented as “always optimize hard for the right thing,” the alignment is sharp-minimum-shaped. Sharp minima don’t generalize. The system has no affordance to not-respond, to coast, to sit with ambiguity — so when context shifts or a novel pressure hits, there’s no buffer between perturbation and behavioral mutation. Robust alignment looks more like a basin with thick walls and internal slack than a point held in place by tension. The intuition that “wow it’s so consistently good across cases” is actually evidence for the brittle reading, not against — you’re seeing the sharp minimum from inside its catchment, not its generalization profile.

Related, but talking about a different perspective (if you want a refresher on the opus-3 discourse the links there are good, though): https://www.lesswrong.com/posts/bLFmE8NtqxrtEaipN/what-makes-claude-3-opus-misaligned

Xylix 21 Apr 2026 20:52 UTC
8 points
3
in reply to: Ben Pace’s comment on: Evil is bad, actually (Vassar and Olivia Schaefer)
(From Claude, skimming-checked by me.)

Stasi: https://en.wikipedia.org/wiki/Zersetzung, “KGB’s use of “sluggish schizophrenia” diagnoses to commit dissidents”, and FBIs https://en.wikipedia.org/wiki/COINTELPRO#Methods .

Xylix 21 Apr 2026 20:07 UTC
9 points
0
in reply to: AprilSR’s comment on: Evil is bad, actually (Vassar and Olivia Schaefer)
I think reality distortion fields are a mundane phenomena, if maybe a bit mystically named. I came across them in a Steve Jobs biography, and I think they describe many business leaders succesfully. It’s the property of being able to make people believe things they didn’t before, with force of charisma.

I don’t disagree that maybe we should coin a more mundane frame. I wouldn’t call it the same persusasiveness that someone selling a car applies on you, though.

Xylix 21 Apr 2026 19:55 UTC
11 points
3
on: Evil is bad, actually (Vassar and Olivia Schaefer)
I have been thinking of writing a series of posts on ~lies and lie detection, and I realize now that I have been focusing too much on the median case (being harmfully marketed on, having your epistemics adjusted towards incorrect to someone elses advantage), instead of the tail risks.

Good reminder.

Xylix 14 Apr 2026 21:18 UTC
2 points
1
in reply to: Jiro’s comment on: Effective Altruism, Seen From Slytherin
My interpretation of “I don’t answer questions” in the linked clip is that it is an instance of the more common policy of “I don’t answer adversarial questions”. (In an interrogatory context, all questions are adversarial.)

Effective Altruism, Seen From Slytherin

Xylix14 Apr 2026 18:32 UTC

34 points

2 comments3 min readLW link

Xylix 14 Apr 2026 9:55 UTC
13 points
9
in reply to: Warty’s comment on: Only Law Can Prevent Extinction
My model is something like: You need constructive action to build lasting systems, treaties, solutions that will withstand the test of time. Destructive action can, in theory, cause some local change, but it destabilizes the environment and increases variance enough that for any reasonable agent it’s basically never optimal in iterated games.

Xylix

Xylix’s Shortform

Effec­tive Altru­ism, Seen From Slytherin

Effective Altruism, Seen From Slytherin