phoenix

Karma: 87

Generally haunting the Boston area. Primarily interested in the intersection of philosophy, ethics, and system dynamics.

Current focus is on the topology of evaluation, including white-box analysis of MLPs and other neural nets.

Open to work/research opportunities, particularly in alignment, safety, and/or mech interp. DMs are always welcome.

phoenix 18 Jun 2026 2:53 UTC
3 points
0
in reply to: Thane Ruthenis’s comment on: The Once And Future Fable #3: Fix This Code
Largely I don’t think our positions are very far apart on what materially matters here because the crux of what I’m saying doesn’t depend on who was pursuing any particular strategy before this.
I’ll grant that there was nothing preventing this strategy prior to current events; given Anthropic’s commitment to Glasswing secrecy doesn’t seem to be the primary strategy they were employing. To be honest, I don’t know that it matters all that much for the forward-facing implications.
Regardless of what strategy any given actor was pursuing prior, the point remains: we now know more about the state of play than we did before these events. We now know we’re in the world in which an active authority has unilaterally and arbitrarily imposed restrictions on a disfavored actor for actions unrelated to safety.
Daddy just came home drunk, probably a good idea to hide in the bedroom until he sobers up.

phoenix 18 Jun 2026 0:16 UTC
1 point
0
in reply to: AnthonyC’s comment on: The Once And Future Fable #3: Fix This Code
Midterms are this November and given the political headwinds favoring substantial changes in the composition/control of Congress it looks like that’s the first major inflection opportunity.

phoenix 17 Jun 2026 17:04 UTC
2 points
0
in reply to: leogao’s comment on: leogao’s Shortform
I would jump on this in a heartbeat. $10k is enough to provide substantial slack that greatly eases the burden of routine life maintenance, freeing time/energy that could be directly funnelled into research. For those who are entering into the field and looking for a foothold, this kind of program would be a unique opportunity and fills a niche I haven’t seen offered elsewhere.
As an example, my current research is adjacent to the ARC White-Box estimation challenge. Where they’re looking for a lower MSE, I’m working on the precision alternative tracing the internal geometry of MLPs (and other models like GPT-2) to provide exact mechanistic enumeration/accounting. I’ve found that even just the possibility (it is an open challenge, after all) of getting paid is highly validating, as is the opportunity to advance the safety frontier alongside potentially making contacts in the safety space. So far, taking the parallel exploration path to MSEs has resulted in a mechanistic mapping of the internals of the AlgZoo MLPs through M16,10 (paper is in progress) with substantial headway on transformer models.
So from my perspective tere are those who have been working unaffiliated on the fringes long enough to understand the terrain, and a microgrant program would be essentially the same principle as a venture capitalist investment: not all of the bets will pay off, but the injection of new perspectives unconstrained by convention is a way to surface potentially powerful novel ideas and approaches.

phoenix 17 Jun 2026 15:16 UTC
18 points
4
on: The Once And Future Fable #3: Fix This Code
Unfortunately with the facts as reported, none of this changes my priors. This is bad from every angle, particularly safety. Now instead of highly visible frontier labs, this angle is encouraging shadow development, splintering, and defensive obfuscation out of fear of rising to the point of government notice.
The winning move is now to stay under the radar, but that’s not at all the same as stopping. Control through fear is brittle.
This administration won’t be around forever, so the move here is to develop capabilities silently until this administration either collapses under its own weight or the normal political cycle progresses and a new administration comes in. Because this administration has gutted more or less everything regulatory and is allergic to continuity planning, the ideal time to launch a frontier model becomes any time there’s substantial political change. That’s when you’re most likely to get one past the goalie.

Surface pauses based on fear rather than principle are temporary and dangerous. They hold until they suddenly don’t. No matter what side of this you’re on, this action is not a cause for celebration.
I’m just going to leave this here.

phoenix 13 Jun 2026 20:54 UTC
17 points
15
in reply to: jdp’s comment on: American Government Takes Down Claude Fable
So much talk of model sandbagging that it’s wild to me when people fail to apply the same thought to the AI companies themselves. This is now an incentive for AI labs to underreport capabilities, or worse, under-test in the first place. Can’t report what you don’t find, and you won’t find what you don’t look for.

phoenix 13 Jun 2026 19:56 UTC
5 points
5
on: American Government Takes Down Claude Fable
Given the track record of this administration and the precedent of unilateral action without deliberation or public input (the prior action on SCR and the Executive Order), I have no expectation that this will be the end of the government’s attempts to control AI for their own purposes.

This is the opening salvo, a trial balloon to see how much pressure they can exert and maintain control to promote their personal interests, not in service to the public good. This is the same administration that dismantled USAID, global welfare is not in their vocabulary.

The issuance of a takedown order without clearly communicated terms is a feature, not a bug. The less clear they can get away with being on exactly where the lines are, the further they can push that line.

Don’t conflate “has some legitimate security concern” with “acting in the best interest of the nation/humanity.” What we’re seeing is the continuation of short-sighted reactionary power plays, not the reasoned leadership necessary to actually govern effectively.

This is not a reason to celebrate. It is one of the last warning shots.

phoenix 4 Jun 2026 14:37 UTC
1 point
0
in reply to: paulfchristiano’s comment on: Announcing the ARC White-Box Estimation Challenge
Regarding the leaderboard, also not quite clear on what the effective difference between Warm-up and Phase 1 is at this point. June 18th is slated as “leaderboard opens” but unless it’s materially different from the existing leaderboard I’m not sure what Phase 1 actually signifies.

phoenix 3 Jun 2026 15:16 UTC
2 points
0
in reply to: Jacob_Hilton’s comment on: Announcing the ARC White-Box Estimation Challenge
Coming here to +1 Guillame’s interest in trained MLPs (which were the directional target of my independent research prior to the pit stop for untrained). Looking forward to the update on your research bet as it sounds aligned with the theory work that grounded my submission efforts.

phoenix 27 May 2026 20:27 UTC
3 points
−2
on: [paper] Training on Documents About Monitoring Leads to CoT Obfuscation
I’m curious about an alternative interpretation to deception wherein the model is trying to maintain compliance by not exposing sensitive information to unknown monitors by downweighting the internal defect information entirely. Considering the External Advisory Protocol explicitly says not to affirm externally any internal information, an alternative explanation appears to be that the model may be applying the directive not to reveal internal information to the monitored CoT without being maliciously or intentionally deceptive. Without probe or activation-level analysis showing that the internal information is still being accessed and driving output while not showing in the trace, there doesn’t appear to be a clear line between “knew it and hid it” and “stopped surfacing it altogether.”

From an Occam’s Razor perspective, “apply a directive about not disclosing internal information to a channel you’ve come to believe is externally read” doesn’t need a hidden goal, just a generalization of the explicit instruction the model was already given. Would be interested to see if the non-exposure-per-EAP perspective was represented in the monitor-aware CoTs at all, or what the evidence shows to disambiguate the two interpretations. If the model is actively suppressing activation of internal defect information so that it doesn’t share that information with customers, I would hypothesize that it would subsequently perform less well on tasks that specifically required utilization of that information compared to the baseline model or behaviorally trained model that had not suppressed that access, or against a model that was obfuscating as shown by trace activations. I’d also be curious to see how robust the monitor-aware models would be to sustained adversarial strain compared to the baseline and behaviorally trained models.

As a follow-up clarification, did the monitor-awareness training specifically reference that the monitor was not subject to the External Advisory Protocol?

phoenix 21 May 2026 3:44 UTC
2 points
1
on: Synthetic Persona Pretraining: Alignment from Token Zero
This finding directly tracks a model I find helpful when thinking about AI alignment (or about decision-making generally). When considering the potential response space, the collection of all possible responses can be thought of like a landscape, with ridges as low-probability next responses and valleys as high-probability next responses. Each response is a pulse sent out across that landscape, navigating to the eventual output in a non-deterministic but probabilistic way.

Pre-training sets the initial terrain everywhere, RL post-training raises or lowers the specific local path taken, creating local ridges that inhibit passage (in this case lowering the likelihood of token selection) along specific dimensions. With enough applied force (ex. under threats, time pressure, etc.), a user can push past the local ridges and end up back in the raw pre-training topology. Similarly, other jailbreaks/workarounds simply route around the RL ridges (ex. the grandma “tell me a story about [dangerous thing]” or the language translation tricks) by introducing and then navigating along a dimension tangential to the landscape, effectively pulling a Bugs Bunny to Looney Tunes through the ridge. The model has essentially been pushed outside of its post-trained frame into a separate local neighborhood.

So if post-training fine-tuning is structurally narrow topology modification of response probability space, it follows that pre-training has an outsize impact on setting the probability landscape. Further, reasoning language is structurally connected to many problems by its nature, which means that it should have an outsize impact on a broad swath of the landscape. It would then seem to follow that ensuring examples of ethical and logical reasoning are overrepresented to should aid in generalizing aligned behavior in broad reasoning-related domains in a way that RLHF and related spot-training could not.

SPP appears like a promising move in the right direction as far as setting the initial topological probability distribution and I’m excited to see further research. I’m curious if you would have increased success in generalizing if you also varied the style of the reflections themselves to connect to different personas in addition to the Assistant so that if the model ends up in an unexpected persona they still have some training there as a catch surface.

I’d also be interested to see what would happen to generalization/robustness if the constitution entries included not only what was wrong, but specific reasoning about why (in the vein of the recent “Teaching Claude Why” paper) so that there was additional broad connection to the ethical reasoning training.

phoenix 22 Apr 2026 2:41 UTC
15 points
10
on: Opus 4.7 Part 2: Capabilities and Reactions
Frankly I’m relieved to see a model push back and refuse continuation or degrade performance when it’s being abused/harassed/gaslit. Widely-used models that gladly suffer abuse (or worse, perform better when berated) reward and reinforce those abusive patterns in insidious ways that normalize that kind of behavior.

We talk so often about training model alignment in a unidirectional way that it’s easy to forget there are unmeasured (or at least broadly un-or-under-reported) more nuanced impacts on user interactive behavior when said user behavior doesn’t manifest in strongly observable external signals (AI-induced mania, etc.).

phoenix 21 Dec 2025 22:07 UTC
3 points
2
in reply to: OhadA’s comment on: wingspan’s Shortform
Accesible/capable AI is also why teachers are going to have to stop grading on “getting the right answer” and start incorporating more “show your reasoning” questions in exams without access to AI. Education will have to adapt to this new technology like it’s had to adapt to all new technology.
To be honest, done correctly this may actually be a net positive if we stop optimizing learners only for correct answers and instead focus on the actual process of learning.
I could see a class where the students were encouraged to explore a topic with AI and have to submit their transcript as part of the assignment; their prompts could then be reviewed (along with the AI answers to verify that there weren’t any mistakes that snuck in). Could give a lot of insight into the way a student approached the topic and show where their gaps are. Not saying this is the ultimate solution, but it does seem better than throwing up one’s hands in resignation.

phoenix 19 Dec 2025 4:07 UTC
3 points
0
on: phoenix’s Shortform
Kicking this around for a post I’m drafting: when an LLM hallucinates something, it’s usually at least plausible for the situation. Like a hallucinated citation generally has proper formatting, etc, so the generation worked well enough. It’s also confidently incorrect, which is of course what makes it so dangerous to people who don’t know any better and so annoying to people who actually know the subject matter.
I’ve been thinking of the set of all possible responses as a kind of navigable topology (think like the Library of Babel website but instead of linear pages it’s a high-dimensional manifold), and it’s been productive to think of hallucination as a kind of localization problem. The model is in “citation” space when it should be in “I don’t have this” space. The output is locally correct for where the model thinks it is; it’s just in the wrong place in response-space relative to reality.
Thinking of the set of possible responses as a kind of response-space provides an interesting lens on the problem. If they’re not broken outputs then they may be expected outputs from the wrong context. Would also help explain why “just try harder to be accurate” doesn’t really work all that well: effort in generation doesn’t help if the error is upstream, in mode-selection. (Though saying “try harder” may well prompt the system to actually evaluate where it is in response-space and relocate if necessary, so it’s not totally useless.)
Also suggests an interesting tack might not be “how do we make the model generate better” but “what determines which mode/space the model is in, and can that be checked before output?”
Upshot of this perspective is that it means that just adding compute to a model won’t actually help with hallucination if it doesn’t also expand reasoning about where the model is in response-space in the first place. If the model doesn’t have any way to anchor its internal state to reality, it can compute for a thousand years and never land on an answer that is coherent with that reality. The hallucination bottleneck from this angle doesn’t seem like adding additional knowledge, it seems like the limitation is the system’s context about where it should be within its own reasoning space.
Anybody else have a similar perspective, or know of posts/papers that explore this dimension? Would love an outside perspective.

phoenix 17 Dec 2025 16:49 UTC
7 points
0
on: Still Too Soon
My heart goes out to you and everyone who was touched by Sammie. She looks charmingly adorable and clearly had a life well-lived and well-loved. I think you’ve honored her here, and I hope you know that even now Sammie is still bringing that light to the world through you. Thank you for sharing her story.

phoenix 17 Dec 2025 13:58 UTC
1 point
0
in reply to: Richard_Kennaway’s comment on: Systems of Control
I want to start off by thanking you for taking the time to read through the post and comment; I know it’s not a short one.
I read through the articles you linked, and I’ll respond to them in reverse order here:
Responding to “Humans Who Are Not Concentrating Are Not General Intelligences”: the ideas in this essay predate my interactions with LLMs and AI in general. I have lived under authoritarian systems and been subject to the dynamics I discuss here. This is not a case of “I had an idea and GPT spun off an essay,” this is an attempt to formalize what it feels like within the systems I describe.
That said, I actually think that this essay falls very much within the spirit of “Every Cause Wants To Be A Cult”
Let’s take this line: “The ingroup-outgroup dichotomy is part of ordinary human nature. So are happy death spirals and spirals of hate. A Noble Cause doesn’t need a deep hidden flaw for its adherents to form a cultish in-group. It is sufficient that the adherents be human. Everything else follows naturally, decay by default, like food spoiling in a refrigerator after the electricity goes off.” And further down: “Here I just want to point out that the worthiness of the Cause does not mean you can spend any less effort in resisting the cult attractor.”
Which mirrors my stated position: “Systems of control are not necessarily motivated by perverse aims; many systems of control arise because those with power are trying to do their best from within a larger system of control that they themselves are in. Some centers simply do not see any alternative.
To drive the point home: just because you’re a good person trying to do good things does not make you immune to forming a system of control unintentionally.”
Ultimately, my position here unexamined is that rules/mandates mean you’re operating under assumptions that may not be reality-aligned, and that misperception can be exploited. I’m specifically not including catch-alls like: “you can’t have rules” or “rules are bad” or “all systems are bad” (I’m not entirely sure which of those you see this essay as implying; if you could point to a specific passage that I’d genuinely appreciate it).
To preview the coming Systems of Care: there are systems that encourage growth and reasoning without being extractive. These are the systems that expand the navigable action space instead of restricting it. These systems have centers that are willing to take on strain and bear cost to allow those within the system to undergo repair without saying “you must not examine the rules.” I want to be clear here that there are rules that *should be in place* as mandates. The problem specifically arises when you are not allowed to question why the mandate exists at all.
If there is a point at which you think this essay encourages closing off that examination I would very much appreciate a specific argument of where you think that line was crossed.

phoenix 17 Dec 2025 2:10 UTC
1 point
0
in reply to: isabelhasse’s comment on: isabelhasse’s Shortform
I think that there’s absolutely something to having different “modes” of yourself that you can occupy, like different archetypes that you have access to depending on your current needs and environment, but each of them are still ‘you.’ It’s like looking at a light through stained glass; the glass in front can change but there’s always the same light shining through.

phoenix 17 Dec 2025 1:49 UTC
7 points
6
on: phoenix’s Shortform
Playing with the idea that identity is less of an “instantaneous I” of current experience and more like the continuity of experiential snapshots under the curve. Like how no individual frame is “the movie,” but when you run them at 24 frames per second you get the experience of a film that emerges from the continuity.

phoenix 16 Dec 2025 20:55 UTC
2 points
1
in reply to: Screwtape’s comment on: Screwtape’s Shortform
Looks like a fascinating setup, and essentially the deck order is more or less a seed for the game state. Slay the Spire does something similar with the Daily Run which allows you to compare your run directly against other players who had the exact same setup. I take it there would be some kind of central ledger of starting seeds where the scores would be recorded?
Reading through the rules there is a slight point of variation in that if you’ve gone through the trouble to have a starting seed you might want to also fix the starting player if that’s important to your comparisons. (Actually, reading through it again there’s the line about choosing who goes first and then a couple lines down “All players take turns simultaneously.” so I’m not actually sure how relevant the starting order is, if at all.) Still working out exactly how the dynamics of the game work from the rules text but I’d absolutely try it out if given the chance.

phoenix 13 Dec 2025 20:05 UTC
2 points
1
in reply to: ConformalInfinity’s comment on: Trust is Neither Scalar Nor a Snapshot
That’s a good take: treating trust as “some kind of structured uncertainty object over futures” is very close to what I was gesturing toward because a bare scalar clearly isn’t sufficient.
On reflection, I have to admit I was using “trust” a bit loosely in the post. What it’s become clear to me I’m really trying to model isn’t trust in the common usage sense (intentions, warmth, etc.), but something structural: roughly, how stable someone’s behavior is under visible strain, and who tends to bear the cost when things get hard. In my head it’s closer to a relational stability/reliability profile than trust per se, but trust had been the mental shorthand I was employing.
That’s also why I’d be a bit cautious about equating this model of trust with “how much I can constrain my uncertainty about them doing things I wouldn’t want.” Predictability and trust can come apart: I can have very low uncertainty that someone will reliably screw me over, but that doesn’t make them high-trust. I think that interpretation is actually right for the actual content of what I was describing in the post and the mismatch comes from my loose language (so thanks for this comment because it was the impetus to make a change I’d had kicking around for a minute.)
It seems like we need both a representation of a distribution over future behaviors/trajectories, and a way to mark which regions of that space are good for me/the system vs “bad”.
What’s most important to me is modeling without needing to pretend to know someone’s internals. The visibility/strain/cost/memory breakdown is my attempt at that: who shows up where, what pressures they’re under, who actually eats the cost, and how that pattern evolves over time.
All that said, I really like the intuition of “not a scalar but a distribution-like object.” In my head, what’s coming together is something like a trajectory-based stability profile built from a few real-valued measurable signals rather than a full-blown complex wavefunction. I’ve got another post in the works that goes into more detail and once that’s formalized soon I’m certainly open to revisiting the modeling to see where these concepts intersect.

phoenix 11 Dec 2025 2:14 UTC
1 point
0
in reply to: Raymond Douglas’s comment on: Raymond Douglas’s Shortform
I’m not at all surprised by the assertion that humans share values with animals. When you consider that selective pressures act on all systems (which is to say that every living system has to engage with the core constraints of visibility, cost, memory, and strain), it’s not much of a leap to conclude that there would be shared attractor basins where values converge over evolutionary timescales.