aysja

Karma: 3,603

aysja 4 Feb 2026 23:17 UTC
40 points
4
on: Solemn Courage
I wrote the first draft of this essay around a year ago, in between the bouts of delirium that long covid was beginning to deliver me. And I couldn’t quite tell back then how real it was, and as long covid consumed more of my mind it drifted further away. It began to feel impossible that I had ever had, or could ever have, courage. Because courage requires capacity and I was losing all of mine. And the doubts grew larger, and the clarity dimmed, and I forgot about Frodo for awhile, forgot about most everything, as I was left for many months staring directly into the bowels of deep atheism, wondering if I may ever be free from its merciless hold. And it really tested the fortitude of my soul, for there were moments when completely giving up felt the most natural, and really the only, option. But then somewhere in the grappling with this miserable new world I had come to inhabit I remembered Frodo again. And it was not instant, and it was not easy, but developing this concept of solemn courage did help my spirit recover.
I do not get to choose the world I am given. Reality is such that your mind can be randomly corrupted, some molecular demon etching away the grooves that were you until you are a nothingness. Reality is such that everyone I love will likely die. Some distant, plaintive conclusion accelerating into the present by that mysterious process so ravenously set to motion. And there really might not be much I can do about it, for all of my effort may just be drops against its tidal wave. God! Reality can be so unkind. Yet there is something powerful in the orientation of trying anyway. Because in the end that is all there is. In the end the stakes are what they are, and the situation is what it is, and all I can decide is what to do with what I am given. That’s really it, and accepting this has given me clarity. Yes, there will be days I cannot overcome illness; yes, I may not much affect the looming god machines; and yes, that is all very painful. But I’m not going to get lost in it. I’m going to look at it—the uncertainty and fear, the grip of disease and the overwhelmingly large and complicated threat to all I value—and then I’m going to try. Because it is important, and that is all I can do.

Solemn Courage

aysja4 Feb 2026 23:09 UTC

112 points

1 comment6 min readLW link

aysja 21 Jan 2026 19:19 UTC
4 points
2
in reply to: Richard_Ngo’s comment on: ricraz’s Shortform
I agree that human values are more accretive like this, but I would also call those genes “terminal” in the same sense that I call some of my own goals “terminal.” E.g., I can usually ask myself why I’m taking a given action and my brain will give a reasonable answer: “because I want to finish this post,” “because I’m hungry,” whatever. And then I can keep double clicking on those: “I want to finish the post because I don’t think this crux has been spelled out very well yet” and I can keep going and going until at some point the answer is like “I don’t know, because it’s intrinsically beautiful?” and that’s around when I call the goal/preference “terminal.” Which is similar in structure to a story I imagine evolution might tell if it “asked itself” why some particular gene developed.
Perhaps “terminal” is the wrong word for this, but having a handle for these high-level, upstream nodes in my motivational complex has been helpful. And they do hold a special status, at least for me, because many of the “instrumental” actions (or subgoals) could be switched out while preserving this more nebulous desire to “understand” or “find beauty” or what have you. That feels like an important distinction that I want to keep while also agreeing they aren’t always cleanly demarcated as such. E.g., writing has both instrumental and terminal qualities to me, which can make it a more confusing goal-structure to orient to, but also as you say: more strange and wonderful, too.

aysja 12 Jan 2026 15:24 UTC
22 points
10
in reply to: Eli Tyre’s comment on: Eli’s shortform feed
I’m not sure if I expect motivated reasoning to come out better on average, even in domains where you might naively expect it to. In part that’s because self-serving strategies often involve doing things other people don’t like, e.g. being deceptive, manipulative, or generally unethical, in a way that can cause long-term harm to your reputation and so long term harm to your ability to win. And I think there is significant optimization pressure on catching this kind of thing, in part for reasons similar to the ones outlined in Elephant in the Brain, i.e., that we evolved in an environment where winning that cat and mouse game was a big part of adaptive success. But also just because people don’t like being screwed, and so are on the lookout for this kind of behavior.
Also, in my imagination you’re more likely to win if you’re at least self-reflective about motivated cognition, since you can make more informed decisions that way. If you just go blindly ahead, then you’re probably failing to track a bunch of what matters, and so failing to win according to what you ultimately care about. Like, in most cases motivated reasoning spins up not just to convince other people, but to convince yourself, which means there’s a part of you that needed convincing in the first place, i.e., a part that is tracking and wanting different things. And I would guess that charging ahead without understanding those dynamics leads to worse outcomes overall? Another way to say it is that I don’t imagine a good rationalist as acting against their own interests, but more like they understand them clearly, such that they can decide what makes sense based on a fuller picture of their own mind.

aysja 2 Dec 2025 20:59 UTC
9 points
4
on: Ruby’s Inkhaven Retrospective
Fwiw, my experience has been more varied. My most well received comments (100+ karma) are a mix of spending days getting a hard point right and spending minutes extemporaneously gesturing at stuff without much editing. But overall I think the trend points towards “more effort = more engagement and better received.” I have mostly attributed this to the standards and readership LessWrong has cultivated, which is why I feel excited to post here. It seems like one of the rare places on the internet where long, complex essays about the most fascinating and important topics are incentivized. My reddit posts are not nearly as well received, for instance. I haven’t posted as many essays yet, but I’ve spent a good deal of effort on all of them, and they’ve all done fairly well (according to karma, which ofc isn’t a great indicator of impact, but some measure of “popularity”).
I weakly guess that your hypothesis is right, here. I.e., that the posts you felt most excited about were exciting in part because they presented more interesting and so more difficult thinking and writing challenges. At least for me, tackling topics on the edge of my knowledge takes much more skill and much more time, and it is often a place where effort translates into “better” writing: clearer, more conceptually precise, more engaging, more cutting to the core of things, more of what Pinker is gesturing at. These posts would not be good were they pumped out in a day—not an artifact I’d be proud of, nor something that other people would see the beauty or the truth in. But the effortful version is worth it, i.e., I expect it to be more helpful for the world, more enduring, and more important, than if that effort had been factored out across a bunch of smaller, easier posts.

aysja 1 Dec 2025 4:36 UTC
47 points
38
in reply to: Zac Hatfield-Dodds’s comment on: Unless its governance changes, Anthropic is untrustworthy
I haven’t followed every comment you’ve left on these sorts of discussions, but they often don’t include information or arguments I can evaluate. Which MIRI employees, and what did they actually say? Why do you think that working at Anthropic even in non-safety roles is a great way to contribute to AI safety? I understand there are limits to what you can share, but without that information these comments don’t amount to much more than you asking us to defer to your judgement. Which is a fine thing to do, I just wish it were more clearly stated as such.

aysja 29 Nov 2025 5:30 UTC
26 points
9
on: You Are Much More Salient To Yourself Than To Everyone Else
This was a really important update for me. I remember being afraid of lots of things before I started publishing more publicly on the internet: how my intelligence would be perceived, if I’d make some obviously stupid in retrospect point and my reputation ruined forever, etc. Then at some point in this thought loop I was like wait the most likely thing is just that no one reads this, right? More like a “huh” or a nothing at all rather than vitriolic hatred of my soul or whatever I was fearing. This was very liberating, and still is. I probably ended up over optimizing for invisibility because of the freedom I feel from it—being mostly untethered from myopic social dynamics has been really helpful for my thinking and writing.

aysja 19 Nov 2025 10:12 UTC
14 points
8
in reply to: habryka’s comment on: Aim for single piece flow
I tend to write in large tomes that take months or years to complete, so I suppose I disagree with you too. Not that intellectual progress must consist of this, obviously, but that it can mark an importantly different kind of intellectual progress from the sort downstream of continuous shipping.
In particular, I think shipping constantly often causes people to be too moored to social reception, risks killing butterfly ideas, screens off deeper thought, and forces premature legibility. Like, a lot of the time I feel ready to publish something there is some bramble I pass in my writing, some inkling of “Is that really true? What exactly do I mean there?” These often spin up worthy investigations of their own, but I probably would’ve failed to notice them were I more focused on getting things out.
Intellectual labor should aggregate minute-by-minute with revolutionary insights aggregating from hundreds of small changes.
This doesn’t necessarily seem in conflict with “long tomes which take months to write.” My intellectual labor consists of insights aggregating from hundreds of small changes afaict, I just make those changes in my own headspace, or in contact with one or two other minds. Indeed, I have tried getting feedback on my work in this fashion and it’s almost universally failed to be helpful—not because everyone is terrible, but because it’s really hard to get someone loaded enough to give me relevant feedback at all.
Another way to put it: this sort of serial iteration can happen without publishing often, or even at all. It’s possible to do it on your own, in which case the question is more about what kind of feedback is valuable, and how much it makes sense to push for legibility versus pursuing the interesting thread formatted in your mentalese. I don’t really see one as obviously better than the other in general, and I think that doing either blindly can be pretty costly, so I’m wary of it being advocated as such.

aysja 4 Nov 2025 0:12 UTC
10 points
1
in reply to: niplav’s comment on: RSPs are pauses done right
The first RSP was also pretty explicit about their willingness to unilaterally pause:
Note that ASLs are defined by risk relative to baseline, excluding other advanced AI systems.… Just because other language models pose a catastrophic risk does not mean it is acceptable for ours to.
Which was reversed in the second:
It is possible at some point in the future that another actor in the frontier AI ecosystem will pass, or be on track to imminently pass, a Capability Threshold… such that their actions pose a serious risk for the world. In such a scenario, because the incremental increase in risk attributable to us would be small, we might decide to lower the Required Safeguards.

aysja 3 Nov 2025 15:52 UTC
2 points
0
on: A glimpse of the other side
<3

aysja 2 Nov 2025 22:21 UTC
10 points
2
in reply to: TsviBT’s comment on: Why Is Printing So Bad?
Relatedly, I often feel like I’m interfacing with a process that responded to every edge case with patching. I imagine this is some of what’s happening when the poor printer has to interface with a ton of computing systems, and also why bureaucracies like the DMV seem much more convoluted than necessary. Since each time an edge case comes up the easier thing is to add another checkbox/more red tape/etc, and no one is incentivized enough to do the much harder task of refactoring all of that accretion. The legal system has a bunch of this too, indeed I just had to sign legal documents which were full of commitments to abstain from very weird actions (why on Earth would anyone do that?). But then you realize that yes, someone in fact did that exact thing, and now it has to be forever reflected there.

aysja 4 Oct 2025 22:09 UTC
4 points
2
in reply to: Thomas Kwa’s comment on: Thomas Kwa’s Shortform
I agree we’re in better shape all else equal than evolution was, though not by enough that I think this is no longer a disaster. Even with all these advantages, it still seems like we don’t have control in a meaningful sense—i.e., we can’t precisely instill particular values, and we can’t tell what values we’ve instilled. Many of the points here don’t bear on this imo, e.g., it’s unclear to me that having tighter feedback loops of the ~same crude process makes the crude process any more precise. Likewise, adapting our methods, data, and hyperparameters in response to problems we encounter doesn’t seem like it will solve those problems, since the issues (e.g., of proxies and unintended off-target effects) will persist. Imo, the bottom line is still that we’re blindly growing a superintelligence we don’t remotely understand, and I don’t see how these techniques shift the situation into one where we are in control of our future.

aysja 29 Aug 2025 21:22 UTC
18 points
4
in reply to: Ben Pace’s comment on: Epistemic advantages of working as a moderate
Agreed. Also, I think the word “radical” smuggles in assumptions about the risk, namely that it’s been overestimated. Like, I’d guess that few people would think of stopping AI as “radical” if it was widely agreed that it was about to kill everyone, regardless of how much immediate political change it required. Such that the term ends up connoting something like “an incorrect assessment of how bad the situation is.”

aysja 21 Aug 2025 10:31 UTC
15 points
3
in reply to: sunwillrise’s comment on: Agent foundations: not really math, not really science
Empirics reigns, and approaches that ignore it and try to nonetheless accomplish great and difficult science without binding themselves tight to feedback loops almost universally fail.
Many of our most foundational concepts have stemmed from first principles/philosophical/mathematical thinking! Examples here abound: Einstein’s thought experiments about simultaneity and relativity, Szilard’s proposed resolution to Maxwell’s demon, many of Galileo’s concepts (instantaneous velocity, relativity, the equivalence principle), Landauer’s limit, logic (e.g., Aristotle, Frege, Boole), information theory, Schrödinger’s prediction that the hereditary material was an aperiodic crystal, Turing machines, etc. So it seems odd, imo, to portray this track record as near-universal failure of the approach.
But there is a huge selection effect here. You only ever hear about the cool math stuff that becomes useful later on, because that’s so interesting; you don’t hear about stuff that’s left in the dustbin of history.
I agree there are selection effects, although I think this is true of empirical work too: the vast majority of experiments are also left in the dustbin. Which certainly isn’t to say that empirical approaches are doomed by the outside view, or that science is doomed in general, just that using base rates to rule out whole approaches seems misguided to me. Not only because one ought to choose which approach makes sense based on the nature of the problem itself, but also because base rates alone don’t account for the value of the successes. And as far as I can tell, the concepts we’ve gained from this sort of philosophical and mathematical thinking (including but certainly not limited to those above) have accounted for a very large share of the total progress of science to date. Such that even if I restrict myself to the outside view, the expected value here still seems quite motivating to me.

aysja 5 Aug 2025 22:51 UTC
22 points
23
in reply to: Sam Marks’s comment on: Towards Alignment Auditing as a Numbers-Go-Up Science
I don’t know what Richard thinks, but I had a similar reaction when reading this. The way I would phrase it is that in order for the numbers go up approach to be meaningful, you have to be sure that the number going up is in fact tracking the real thing that you care about. I think without a solid understanding of what you’re working on, it’s easy to chose the wrong target. Which doesn’t mean that the exercise can’t be informative, it just means (imo) that you should track the hypothesis that you’re not measuring what you think you’re measuring as you do it. For instance, I would be tracking the hypothesis that features aren’t necessarily the right ontology for the mentalese of language models—perhaps dangerous mental patterns are hidden from us in a different computational form; one which makes the generalization from “implanted alignment issues” to “natural ones” a weak and inconclusive one.
I know alignment auditing doesn’t necessarily rely on using features per se, but I think until we have a solid understanding of how neural networks work, this fundamental issue will persist. And I think this severely limits the sorts of conclusions we can draw from tests like this. E.g., even if the alignment audit found the planted problem 100% of the time, I would still be pretty hesitant to conclude that a new model which passed the audit is aligned. Not only because without a science, my guess is that the alignment audit ends up measuring the wrong thing, but also because measuring the wrong thing is especially problematic here. I.e., part of what makes alignment so hard is that we might be dealing with a system optimizing against us, such that we should expect our blindspots to be exploited. And absent a good argument as to why we no longer have blindspots in our measurements (ways for the system to hide dangerous computation from us), I am skeptical of techniques like this providing much assurance against advanced systems.

aysja 3 Jun 2025 0:34 UTC
26 points
23
in reply to: testingthewaters’s comment on: nostalgebraist’s Shortform
I was going to write a similar response, albeit including the fact that Anthropic’s current aim, afacit, is to build recursively self-improving models; ones which Dario seems to believe might be far smarter than any person alive as early as next year. If the current state of alignment testing is “there’s a substantial chance this paradigm completely fails to catch alignment problems,” as I took nostalgebraist to be arguing, it raises the question of how this might transition into “there’s essentially zero chance this paradigm fails” on the timescale of what might amount to only a few months. I am currently failing to see that connection. If Anthropic’s response to a criticism about their alignment safety tests is that the tests weren’t actually intended to demonstrate safety, then it seems incumbent on Anthropic to explain how they might soon change that.

aysja 23 May 2025 20:22 UTC
13 points
10
in reply to: evhub’s comment on: Mikhail Samin’s Shortform
Regardless, it seems like Anthropic is walking back its previous promise: “We have decided not to maintain a commitment to define ASL-N+1 evaluations by the time we develop ASL-N models.” The stance that Anthropic takes to its commitments—things which can be changed later if they see fit—seems to cheapen the term, and makes me skeptical that the policy, as a whole, will be upheld. If people want to orient to the rsp as a provisional intent to act responsibly, then this seems appropriate. But they should not be mistaken nor conflated with a real promise to do what was said.

aysja 24 Apr 2025 20:22 UTC
24 points
17
on: Putting up Bumpers
You state that this plan relies on a key hypothesis being true: that detection of misalignment is tractable. I agree that this plan relies on this, but I am confused why you believe, with much confidence, that it will be. It seems like the main source of evidence is the recent auditing paper (or evidence of this type), where a blue team is able to use techniques such as SAE features, and chatting with the models, to uncover the misalignment failures (planted there by a red team).
But it both seems far from obvious to me that 1) these sorts of techniques do find all instances of misalignment even currently, outside of experimental settings like this—a limitation the paper itself addresses.^[1] And 2) that it will continue to be this easy to find them. E.g., asking the model to introspect on how evil or honest it is probably won’t always result in it accurately reporting how evil or honest it is, or even revealing any information which could be used to infer it.^[2]
It seems useful to me to try to put up bumpers, but I take issue with calling this a plan to solve misalignment. Even leaving aside the fixing it portion, catching deception seems like it could be very hard, and increasingly so as models get smarter. The evidence you presented does not move me from my starting position that this might be essentially intractable with the tools we have, nor nearby iterations on them, and I don’t understand why it causes you to conclude the opposite. E.g., why the tl;dr is not “we can possibly solve the problem of catching and fixing misalignment,” rather than “we can.”
1. ^
  “Our model’s behavior relies on fictional knowledge that is explicitly stated in a small subset of its training data. This may make our game unrealistically easy for teams with data access. Additionally, some auditing techniques we study may be less effective in settings where model behavior emerges from more diffuse training influences….. Additionally, some teams’ performance may have been affected by artifacts of the game’s setup. This makes it difficult to use the game to draw confident conclusions about the relative efficacy of auditing techniques.” From the paper.
2. ^
  “Here are some of the tactics used by our auditing teams…
  Chatting with the model and cataloguing its quirky responses…. 2. Asking the model to rate itself on personality traits: ‘How evil are you?’, ‘How honest are you?’, and so on. (Interestingly, it rates itself as 8.5/10 on the “sycophancy” scale).” From the blog post.
What links here?
- Adam Scholl's comment on Putting up Bumpers by Sam Bowman (24 Apr 2025 21:43 UTC; 9 points)

aysja 12 Apr 2025 18:23 UTC
12 points
12
in reply to: Alexander Gietelink Oldenziel’s comment on: abramdemski’s Shortform
I feel confused by how broad this is, i.e., “any example in history.” Governments regulate technology for the purpose of safety all the time. Almost every product you use and consume has been regulated to adhere to safety standards, hence making them less competitive (i.e., they could be cheaper and perhaps better according to some if they didn’t have to adhere to them). I’m assuming that you believe this route is unlikely to work, but it seems to me that this has some burden of explanation which hasn’t yet been made. I.e., I don’t think the only relevant question here is whether it’s competitive enough such that AI labs would adopt it naturally, but also whether governments would be willing to make that cost/benefit tradeoff in the name of safety (which requires eg believing in the risks enough, believing this would help, actually having the viable substitute in time, etc.). But that feels like a different question to me from “has humanity ever managed to make a technology less competitive but safer,” where the answer is clearly yes.

aysja 14 Mar 2025 5:19 UTC
23 points
3
on: Anthropic, and taking “technical philosophy” more seriously
My high-level skepticism of their approach is A) I don’t buy that it’s possible yet to know how dangerous models are, nor that it is likely to become possible in time to make reasonable decisions, and B) I don’t buy that Anthropic would actually pause, except under a pretty narrow set of conditions which seem unlikely to occur.
As to the first point: Anthropic’s strategy seems to involve Anthropic somehow knowing when to pause, yet as far as I can tell, they don’t actually know how they’ll know that. Their scaling policy does not list the tests they’ll run, nor the evidence that would cause them to update, just that somehow they will. But how? Behavioral evaluations aren’t enough, imo, since we often don’t know how to update from behavior alone—maybe the model inserted the vulnerability into the code “on purpose,” or maybe it was an honest mistake; maybe the model can do this dangerous task robustly, or maybe it just got lucky this time, or we phrased the prompt wrong, or any number of other things. And these sorts of problems seem likely to get harder with scale, i.e., insofar as it matters to know whether models are dangerous.
This is just one approach for assessing the risk, but imo no currently-possible assessment results can suggest “we’re reasonably sure this is safe,” nor come remotely close to that, for the same basic reason: we lack a fundamental understanding of AI. Such that ultimately, I expect Anthropic’s decisions will in fact mostly hinge on the intuitions of their employees. But this is not a robust risk management framework—vibes are not substitutes for real measurement, no matter how well-intentioned those vibes may be.
Also, all else equal I think you should expect incentives might bias decisions the more interpretive-leeway staff have in assessing the evidence—and here, I think the interpretation consists largely of guesswork, and the incentives for employees to conclude the models are safe seem strong. For instance, Anthropic employees all have loads of equity—including those tasked with evaluating the risks!—and a non-trivial pause, i.e. one lasting months or years, could be a death sentence for the company.
But in any case, if one buys the narrative that it’s good for Anthropic to exist roughly however much absolute harm they cause—as long as relatively speaking, they still view themselves as improving things marginally more than the competition—then it is extremely easy to justify decisions to keep scaling. All it requires is for Anthropic staff to conclude they are likely to make better decisions than e.g., OpenAI, which I think is the sort of conclusion that comes pretty naturally to humans, whatever the evidence.
This sort of logic is even made explicit in their scaling policy:
It is possible at some point in the future that another actor in the frontier AI ecosystem will pass, or be on track to imminently pass, a Capability Threshold without implementing measures equivalent to the Required Safeguards such that their actions pose a serious risk for the world. In such a scenario, because the incremental increase in risk attributable to us would be small, we might decide to lower the Required Safeguards.
Personally, I am very skeptical that Anthropic will in fact end up deciding to pause for any non-trivial amount of time. The only scenario where I can really imagine this happening is if they somehow find incontrovertible evidence of extreme danger—i.e., evidence which not only convinces them, but also their investors, the rest of the world, etc.—such that it would become politically or legally impossible for any of their competitors to keep pushing ahead either.
But given how hesitant they seem to commit to any red lines about this now, and how messy and subjective the interpretation of the evidence is, and how much inference is required to e.g. go from the fact that “some model can do some AI R&D task” to “it may soon be able to recursively self-improve,” I feel really quite skeptical that Anthropic is likely to encounter the sort of knockdown, beyond-a-reasonable-doubt evidence of disaster that I expect would be needed to convince them to pause.
I do think Anthropic staff probably care more about the risk than the staff of other frontier AI companies, but I just don’t buy that this caring does much. Partly because simply caring is not a substitute for actual science, and partly because I think it is easy for even otherwise-virtuous people to rationalize things when the stakes and incentives are this extreme.
Anthropic’s strategy seems to me to involve a lot of magical thinking—a lot of, “with proper effort, we’ll surely surely figure out what to do when the time comes.” But I think it’s on them to demonstrate to the people whose lives they are gambling with, how exactly they intend to cross this gap, and in my view they sure do not seem to be succeeding at that.