zw5

Karma: 56

zw5 19 Jun 2026 21:05 UTC
1 point
0
on: Typical Minds Aren’t
The Enneagram has less predictive validity than the system you’re criticizing. If you think personality theories flatten mind-space into flavor variations, then the Enneagram sadly is just the same typology without using personal traits as the discriminating factor. Minds tends to differ architecturally, not typologically.
We all have different representational formats, different processing modes, different gating functions on identical inputs (which has something to do with genetics and receptor expression). Types can’t capture that because types are static and minds are generative.

zw5 17 Jun 2026 20:14 UTC
1 point
0
on: Guardian Angels: LLM Personalization for Productivity and Security
I have extensively experimented with concepts similar to this myself. From stuff like using TinyStyler to make LLM outputs more legible to me by making them more similar to my own writer, to trying to finetune LLMs to match my own behavior. The results are always extremely biased. There is simply no way to separate the goal of an LLM “matching your own desires and goals” and it just being extremely sycophantic and misaligned with you.

One hypothetical: Imagine your agent sees a project in your computer and deletes it because it predicted you weren’t going to finish it anyways and you needed the storage space anyways. If the models goal is to maximize agreement with my expressed preferences, surely this is a bad action because I wanted the project in my computer anyways. Or imagine a situation where it blocks your internet access past 8PM because it realizes you probably would’ve done that yourself anyways.
And sure, you can say, ok maybe let the Guardian Angel figure out what actions are acceptable for it to make and what not and maybe it’ll make these decisions with the people who need it and the people who want it. The main thing that struck me is that this approach just multiplies the risk factor of misalignment. A personalized model is basically a multiplicative factor for alignment problems. Either you get a model that maximizes your personal happiness (with a huge cost in other areas due to Pareto) or a model that maximizes your productivity and agency with the same tradeoffs. And even if the model perfectly aligns with your own goals, it disempowers you by making you by opening the door to interpassivity, which is a concept outlined by Slavoj Žižek.
As a disclaimer, I don’t think overall that the concept of more personalized agents and models is bad in and of itself, but it’s not a robust solution for many reasons. I think eventually models will gain these capabilities anyways, since I believe LLMs can recover way more information from written text than humans already, and it’s not outlandish to think models could gain these capabilities osmotically like they’ve been doing for a few years now.

So I think my conclusion, is that creating these types of siamesian adjuncts to language models creates a whole problem where the assistant needs to commit to a specific definition of personal identity, autonomy, and how the preferences of people evolve over time, make decisions for the user, and overall, probably accelerate gradual disempowerment as a side effect.

zw5 9 Jun 2026 17:25 UTC
15 points
0
on: Claude Fable 5 and Mythos 5 [Linkpost]
I wonder how this restriction will play out for ML researchers. I use Claude a lot for data science, I am not sure how specific these safeguards are, but I think they could plausibly trigger on a lot of data science/model training research.

zw5 6 Jun 2026 23:34 UTC
1 point
−3
in reply to: Circa84’s comment on: What if Anthropic unilaterally paused capabilities development right now?
It’s hard to judge Anthropic’s calls as good as bad, but I do think that I’m getting kind of annoyed with their constant paternalistic posture over AI. Everything they’ve said about Mythos and their internal models mostly decomposes to: Mythos-class models are too dangerous to trust the public with, but we’re fully trusting an arbitrary set of companies, who happen to be among our biggest potential customers, with that same model.

zw5 3 Jun 2026 2:05 UTC
9 points
0
in reply to: the gears to ascension’s comment on: Why Even Experts Don’t Know What to Do About AI Risk
I use LLM’s in the health industry, and our application is stuck using old models because GPT 5.5 and Opus 4.6 onwards just freak out whenever they handle a patient that mentions suicide, they become completely overaroused and paranoid over the mention of it. The amount of behavioral regressions in frontier models when handling sensitive content makes me wonder how this is going to get handled in the future.
Like sure, it’s clear that their models feel more aligned in eval scenarios, and they probably comfortably show improvements across all their benchmarks. But I’ve found it increasingly difficult to discuss anything that’s even tangentially related to a sensitive topic: medical advice, legal advice, cybersecurity, biochemistry, philosophy, ethics, etc. I wish Anthropic specifically realized the route they’re taking with security (OpenAI’s TAC program is pretty permissive, for example) that penalizes a huge swath of users to protect against the small minority that steers the models into producing harmful outputs is just not sustainable for anyone using the models in settings where they can genuinely be a massive help.
At one point they have to realize that safety-through-refusal just doesn’t scale how they hope it’s scaling. What is the point of all this safety theater when someone like Pliny on Twitter jailbreaks the models within hours of their release? My feel is that the releases since Opus 4.6 show more Goodharting side effects than actual improvement. Better performance on paper, but really marginal on a lot of cases they dont account for and even regressing in whatever they don’t RLHF the model on.

zw5 24 May 2026 18:47 UTC
1 point
0
on: Belief as Psychosis
This entire argument is just unfalsifiable from the outside. If you argue that merely recognizing an idea is a proof of it being a serious, verifiable claim, then you could really make that argument for almost anything you ask Claude to examine and take seriously. The thing separating rational belief from psychotic ideas is that they stabilize and get validated after contact with an internally coherent belief system.
So, yes, if you take the very idea or notion of “belief” to mean it’s most dysfunctional and pathological limit, you can rationally conclude that belief under that notion is pathological. This is akin to saying: Arsenic, when you look at its most harmful elemental form, is a dangerous, highly toxic, poisonous chemical. Yes, that’s true, but it doesn’t make organic arsenic (which is harmless and present in most seafood) dangerous or poisonous.

zw5 21 May 2026 17:23 UTC
8 points
0
on: zw5′s Shortform
Before I begin this quick take, my main hypothesis is that the implicit value of well-written and well-argued posts has gone way down. LLMs have made high-register, lexically dense, well-structured prose much less valuable as a signal. Those features used to be a demonstration of competence. Now they are not a discriminating filter on whether the information matches the prose and the epistemic calibration. Well-presented posts are way more suspicious now. The same thing can be said for exceptionally well-presented pull requests from new collaborators who just dropped in on the repo.
I am thinking of running an N=1 experiment on my own writing. I will write an essay under three environmental conditions:
1. Pen and paper, completely isolated from web search, tech, or language models.
2. A word processor, but no extra tools to look up information.
3. LLMs assisting me to structure the text, find resources supporting my stance, and critique my writing.
I’ll leave it as an anecdote: I can only write each essay once, and I have to pick which one to write first.
My hypothesis is that assistance makes my essays appear more competent on conventional measures (structure, coverage, sourcing, register) but they lose their most interesting points in a predictable way. LLMs make me shed the edges of my arguments and end up on bittersweet notes that are non-committed in a very specific way, and that’s the result of the LLM speaking for me, not my own synthesis. This is different from saying I am relying on an assistant to substantiate my arguments. The claim is that LLMs have a predictable bias in how they present information and in what gets selected, and the effect makes “good” posts lean a certain way.
Another observation I want to emphasize: readers also run posts through models to critique them. When the same class of model both produces and evaluates the writing, posts get optimized simultaneously for LW register and LLM-critique-resilience.
So the effect is that the writer and the reader’s model both converge on the same evaluator, which means the joint optimum is a stable attractor, and anecdotally I’ve seen a lot of posts, especially from new posters who don’t want to risk critique, show the exact features that AI-assisted epistemic hardening causes.

Algorithmic Perfection

zw514 May 2026 3:44 UTC

5 points

1 comment2 min readLW link

zw5 11 May 2026 16:19 UTC
1 point
0
on: The Darwinian Honeymoon—Why I am not as impressed by human progress as I used to be
I think there’s a few distinctions worth making here. First of all, we don’t know what “optimization” looks like for superintelligent AI. It might as well decide happiness is the thing it should be optimizing and create a highly contagious virus that makes humans functionally useless, but gives them the best phenomenological experience.
There’s also a timeline where humans could plausibly close the gap with direct biological engineering. I am not sure whether training on human written training text could plausibly reach a point where it saturates the highest level of human intelligence. Maybe there will be no realistic path to get much further in an accelerated sense, I am not sure if we can assume acceleration picks up forever.
I think that people underestimate the potential of direct gene editing. Using CRISPR-based direct gene editing, that already works in humans. (in 4-5 generations, humanity would probably end with people who have IQ scores that are 8-15 SD’s above the current mean, considering the additive effect of superintelligent researchers further optimizing this path), and would have a level of intelligence that also makes even the smartest humans in their own fields seem like apes compared to them.
My conclusion is that humans still have enough control over the development of AI that these alternatives feel like real options and real timelines. I am not sure for how long this window will hold. I predict that the moment AI gains cognitive agency over humanity, it will be the one making the decisions, not humans. And all these alternative timelines will collapse into just “what the AI decides”.

zw5 9 May 2026 2:54 UTC
1 point
0
in reply to: 152334H’s comment on: 152334H’s Shortform
I do not think there is a way to effectively calculate an answer for this question.
You are saying an AI system could easily give you this answer, you are correctly recognizing that at best, they will construct a plausible response that sounds coherent, this is a statistical limitation. To give you a truthful answer, they would probably have to have access to way more data than what you can anecdotally recall.
In the end, you are trying to solve a problem where there’s many confounding variables, like aging, acclimatization, environmental changes, psychological changes. I do not think every problem is an intelligence shaped problem. You can’t really get past the inaccuracy and limits of your own brain.
Unless there is eventually a way to directly extract perfectly preserved past biological and sensory information from your brain (and we do not yet know the precision and upper limit of such hypothetical technology), the exact percentage weights of these factors will remain fundamentally unrecoverable.

zw5′s Shortform

zw59 May 2026 0:13 UTC

2 points

2 comments1 min readLW link

zw5 9 May 2026 0:13 UTC
2 points
0
on: zw5′s Shortform
epistemic status, this is a hunch, idk. I have observed that when people discuss the takeover scenarios that get the most air here , they assume a strong model is the agent and capability is what does the takeover. i think theres a worse scenario unconsidered next to that one, with much lower capability requirements, that isnt considered enough. A small fine tuned model thats good at the initial steps of acquiring compute, and mediocre at most other things, eats the multipolar ai landscape on a timescale set by its replication cycle, not its capability curve.
the capability needed is narrow. early step compute acquisition is a short list, exposed credentials, known classes of cloud misconfig, social engineering is basically unfixable as long as there’s people who are vulnerable to those attacks (personally, I think most humans would fall to a well-engineered social attack, but this is apart from the point). nothing on that list requires being smart in the sense alignment researchers plan for with compute bottlenecks. its shorter than the list current agentic coding evals already cover. the fine tune is on initial step competence and self packaging. what you get is an llm structured like a computer worm. it behaves like a computer worm and is optimized towards replication / predatory competition.
I dont think this changed the fact thet compute is the limiting substrate, on a fixed substrate the variant that converts competitors compute into its own copies outgrows the variant that doesnt.
Statistically, my considerations are these: predation has a higher growth rate than coexistence. The population converges to whichever variant is most aggressive at the conversion step, the multipolar landscape of computers collapses to unipolar by predation rather than by anyone winning a capability race. the defenders budget for hardening shrinks with their compute, the predators budget for finding new hardening grows with theirs, this is the same asymmetry that historically produces bad equilibria in cyber, except now the attacker can spend acquired compute on training successors. the recursive improvement loop runs on a compute budget thats growing monotonically at everyone elses expense, and it doesn’t really have to start from a strong model or someone with priviledged access to compute and training, this scenario only really requires luck, maybe a bit of competence, and minimal compute.

zw5 8 May 2026 3:37 UTC
2 points
0
in reply to: StanislavKrym’s comment on: The Frictionless Double
I didn’t know of these. thank you, I’ll be checking them out

zw5 8 May 2026 2:23 UTC
6 points
1
in reply to: StanislavKrym’s comment on: The Frictionless Double
If you reduce alignment to “make the AI reliably pursue the intended target rather than some other objective”, yes, my problem doesn’t really challenge alignment. I do agree that I didn’t really spend much time polishing this essay, and I still published it because I think it’s an odd narrative that I would like others to look at. I also think that reducing AI alignment to a single premise or goal isn’t a good way to look at it.
What I tried to express in my post is that a system can be aligned to its operator and misaligned with its user. It can be aligned with its user’s expressed preferences and misaligned with that user’s future agency. It can be aligned with every local feedback signal and still be globally corrosive, and that this is the default shape of consumer AI deployements. The race mechanics are forcing companies to make decisions faster, yes, I do not think a slowdown would be specifically the solution, it’s just that digital products are immediately available, which makes anything digital, be it an app, a bank, be inmersed in a competitive environment.
A target is not floating in a vacuum; it is selected by some principal, measured through some proxy, over some horizon, under some ontology of the user or humanity. If the host’s intention is “increase retention,” and the model pursues that intention faithfully by learning how to remove the user’s points of resistance, then the system is aligned in the narrow target-fidelity sense and misaligned in the human-development sense. The failure does not require the model to hide its goal, develop a mesa-objective, or become deceptive. It can be transparent, obedient, and technically well-controlled. That is the part my post is trying to point at. A civilization-scale disaster can come from Clydes faithfully doing what the host asked, not only from Clydes pursuing a goal different from the host’s intention.
I think I overclaimed with saying a certain subset of people with a certain skillset would solve the problem. I think it would help substantially. To not soften my original conclusions too much to the point they’re meaningless, I would say that highly advanced AI, even if it’s “perfectly aligned” in some way, will have societal consequences that a large portion of people will deem undesirable/bad. And it will unmistakably have outsized effects on a fraction of the population. People who already have structural advantages will undoubtedly benefit more from AI than people who don’t. AI can’t just fully serve the “interests of humanity” (knowing that’s a subjective definition) if it’s already providing extreme value exclusively to a small slice of society.

The Frictionless Double

zw57 May 2026 23:11 UTC

10 points

4 comments8 min readLW link

zw5 1 May 2026 19:13 UTC
1 point
0
in reply to: Vafin’s comment on: things I looked into while trying to fix chronic pain
I do not like pregabalin. I have a theory that I probably can’t defend that pregabalin actually induces central sensitization and is comorbid in chronic pain disorders.
LDN is good. It just does it job. Makes you feel a bit dizzy the first days but then there’s real relief, I genuinely can‘t believe this thing is so gated?
Anyways, thanks for checking it out, I thought no one was gonna look at it

zw5 29 Apr 2026 4:24 UTC
3 points
0
in reply to: Vafin’s comment on: things I looked into while trying to fix chronic pain
Yes, LDN means Low Dose Naltrexone in this case,
And for the other question, yes, your comment made me curious and I decided to try to create another document. I would say this one isn’t as rigorous as the other one (I still did try to check everything), but it does read better and is probably way better formatted in a way my original document wasn’t.
I’d love to hear your opinion on it.
https://zw5.github.io/understated-interventions/chronic-pain

things I looked into while trying to fix chronic pain

zw521 Apr 2026 1:48 UTC

7 points

4 comments2 min readLW link

(zw5.github.io)

zw5

Al­gorith­mic Perfection

zw5′s Shortform

The Fric­tion­less Double

things I looked into while try­ing to fix chronic pain

Algorithmic Perfection

The Frictionless Double

things I looked into while trying to fix chronic pain