phoenix

Karma: 141

Generally haunting the Boston area. Primarily interested in the intersection of philosophy, ethics, and system dynamics.

Current focus is on the topology of evaluation, including white-box analysis of MLPs and other neural nets.

https://github.com/OperatorPhoenix/ATLAS (currently set to private, DM if you’re interested in access)

Open to work/research opportunities, particularly in alignment, safety, and/or mech interp.

DMs are always welcome.

phoenix 11 Jul 2026 12:31 UTC
1 point
0
in reply to: habryka’s comment on: Banning Said Achmiz (and broader thoughts on moderation)
As an outsider to all of this, from my perspective I don’t know why you keep letting him bait you into continuing this argument. It doesn’t appear productive and it’s just dragging everyone through the mud as everyone tries to get the last word in?

phoenix 9 Jul 2026 4:19 UTC
3 points
0
on: phoenix’s Shortform
One of the problems with performing independent AI safety research is that if you do actually end up coming up with something novel and sufficiently powerful you immediately run into a catch-22: you can’t broadcast the specifics (because safety research is capabilities research), and without a public track record it’s unlikely that anyone relevant will entertain a private review.
Everyone operates from a locally justifiable position, but it does make getting information into the hands of the people/organizations best suited to act on it responsibly difficult.

phoenix 7 Jul 2026 5:11 UTC
5 points
−1
on: A global workspace in language models
Thrilled to see a paper like this given headline status. I’m a firm believer that this kind of interpretability work is the on-ramp to a paradigm shift in the AI safety landscape (including the ability to install valence and drives for bounded direct steering) but to do that safely you need a way to reliably read the mechanics occurring within the net.
To that end, I’ve been working on a white-box mech interp tool called ATLAS that does exact from-weights reads of model internals, so given the overlap of that instrument and the J-lens, I wanted to register on record a few predictions/extensions about the workspace from my observations:
1. The J-space appears to be substantially inflated by word frequency because of the model’s final-layer bias. That bias is a fixed per-token offset added to every readout regardless of context, and it correlates with log token-frequency on the order of 0.67. When you subtract that term, the top readouts shift toward rarer and more specific tokens. I obviously don’t have access to frontier model internals, but on GPT-2-small I’ve found that the workspace internals fall below median frequency once that prior is removed (though this did not appear on GPT-2-large). My read is that the “few dozen concepts / <10% of activity” figures are likely an overcount caused by a constant that the J-lens reading does not remove or account for.
2. In models with a mean‑subtracting (LayerNorm‑style) final norm (like my GPT-2 testbed), the final normalization step subtracts each token’s mean before readout, which makes the readout exactly invariant to the all-ones/mean direction. The upshot is that a concept living purely on that axis survives at fraction zero, while a concept with fraction alpha on it leaks through at exactly sqrt(1 - alpha^2).
3. The workspace-to-motor boundary is a hard geometric event that can be located in the weights without using prompts. The boundary is the exact layer where the leading candidate’s output direction flips from roughly orthogonal to the residual stream to aligned with it. That’s the point where the direction stops acting as an independent feature and is absorbed into the residual the model reads out. If you measure the fraction of that direction surviving final normalization, the weights-only quantity should track the empirical boundary at around 0.85 rank correlation. It’s a sharp transition, I found that on a 36-layer model, it happens over roughly four layers in the final ~10% of depth, while the readout stays non-committal for the ~20 layers before that. From this angle, “workspace” and “motor” are not really two separate systems so much as one computation before and after alignment.
4. “Broadcasting” appears to be almost entirely a read-side property, and the neurons writing workspace content look reflexive rather than outward-broadcasting. The connectivity that marks something as a hub is concentrated on the read side at a minor multiple above isotropic baseline, roughly 1.3-1.6x and strongest in the attention query/key routing circuits. The write-side connectivity sits much closer to baseline, around 1.1x. The MLP neurons that produce vocab-sharp content write back onto the same axes they read from, with read/write alignment around 0.4, more than 10 sigma above chance by my initial measurements. So the intuitive picture of an outward-broadcasting emitter seems backwards: a direction is a hub because it is widely read, not because it is widely written. I also expect a random rotation of the workspace directions to erase the broadcast completely, which would mean the effect is specific to those exact directions rather than merely their subspace.
5. Being in the workspace and being wired into the network are two independent axes. In my testing I’ve found that the model’s self-regulation machinery is read-connected at essentially the same level as the reportable workspace, around 1.2-1.3x baseline, but its vocabulary image is flat. The workspace’s vocabulary image is about an order of magnitude sharper than that. If you match two populations on connectivity, I would expect them to still separate cleanly on verbalizability. So I don’t think the workspace is the “integrated” part of the model so much as it’s the nameable part.
6. Workspace-level monitoring and steering have a hard ceiling. The workspace is under a tenth of the activation, and the machinery that actually selects lives in the silent ~90% so its not verbally reportable, but still highly connected. You can move what the model says, and some of what it reasons, but the selection layer runs below report and can route around interventions aimed at the workspace. My expectation is that reading and steering the J-space will look powerful but will ultimately prove bounded as a structural limitation of the J-lens method as implemented.
Ultimately, the J-lens is a good step towards precision interpretability and with the upgrade to direct reads the blind spots can be mitigated such that it will be an indispensable safety tool.

phoenix 6 Jul 2026 0:16 UTC
10 points
2
in reply to: anaguma’s comment on: anaguma’s Shortform
Goose value 0

phoenix 5 Jul 2026 13:49 UTC
1 point
−2
in reply to: J Bostock’s comment on: Jemist’s Shortform
Fully agree with your intent here, which is to say that these are systems that have regular predictable behavior (eg they don’t devolve into chaos), therefore we should be able to derive high-level frameworks/understandings of those dynamics on a mathematical level that are greater than the sum of their parts. It’s obviously easy to take potshots at where the rough edges are here, but I don’t think it takes a lot of charitable interpretation of system dynamics to see the coherence in your point.

phoenix 5 Jul 2026 1:28 UTC
1 point
0
in reply to: neverix’s comment on: neverix’s Shortform
Thanks for sharing this, very cool to see active research in this direction! It’s exciting to see the field starting to move this way because there is so much potential here.
Getting the geometry right isn’t an easy jump to land, but it’s clear that they’re on to something given their steering successes. The transition of the atomic unit from direction to its own subspace is a rock solid way of handling the linear/nonlinear complexities of navigating manifold routing.

I’m fully in agreement with the authors that SAEs leave so much on the table by ignoring features’ internal geometry. That said, I think they don’t move far enough as they’re still using lossy tools reading concepts off of fit. (To their credit they are doing a very clean implementation, but a clean implementation of a lossy method is still lossy.)
The logical upgrade here is a method that allows for reading the exact manifold off the weights instead of using a dictionary to rank approximations. That’s where things really start to unlock from what I’m seeing.

phoenix 3 Jul 2026 17:31 UTC
2 points
0
in reply to: julius vidal’s comment on: leogao’s Shortform
I’d argue that it is actually more rational to question one’s experience of the world specifically because brains are unreliable.
There are times in altered states in which I am very much aware that I am experiencing the world in a way that is quite outside the normal bounds of my everyday lived experience. I’ve made the mistake of having one too many cannabis edibles a couple of times; the first time the sensations were so intense that I thought I was literally going to die. We’re talking that heart-squeezed-in-the-chest-cold-sweats-can’t-move level of high. The second time I felt the exact same way, except now I was armed with both the trigger and prior experience to lean on such that I knew I was just along for the ride until the high wore off.

In knowing that I was altered, I could reliably doubt my experiences without doubting the ability of instrument to examine itself, especially in post-experience reflections.
Self-examination is a checking loop, and the outcomes don’t have to be binary. The important thing is that you can’t grade the instrument from the inside, you have to have external anchors (and preferably those anchors are different in kind, like checking if a voice you heard could have come from something nearby and also asking anyone else around if they heard that voice). You might find that you reliably experience the world differently than other people, and that’s incredibly important for grounding rational thinking. If your brain is unreliable, it’s counterproductive to try to reason yourself into a believing that it is.

Refusing to update/examine your priors (eg. refusing to entertain that your instrument might be miscalibrated) is itself a form of dogma.

phoenix 2 Jul 2026 3:36 UTC
5 points
2
in reply to: Brendan Long’s comment on: Model access for third-parties — it’s a big deal!
Yeah, I think you’re spot on with the conversation conversion issue because those aren’t different in character from what I’ve been prompting with tonight. Given the expense of Fable tokens since a model switch has to re-cache the entire conversation I just started new ones so I didn’t run into that specifically.
Gotta love that we get a black box on top of the black box, as a treat.

phoenix 2 Jul 2026 2:13 UTC
5 points
0
in reply to: Brendan Long’s comment on: Model access for third-parties — it’s a big deal!
I’m certainly not using standard architecture and I haven’t had any issues at all with the classifier today, though I’m being careful to keep my language as neutral as possible and let Fable guide as much as possible. I’ve worked on my bespoke architecture, GPT-2, Pythia (70M and 160M), Cerebras (111M), all sorts of ambitious mech interp work that involves trained and untrained nets for ~2 million tokens now.
I’d be curious to see the specific prompts that are getting shut down.

phoenix 1 Jul 2026 13:12 UTC
3 points
0
in reply to: Noosphere89’s comment on: The Once And Future Fable #5
Anthropic also has an official announcement up with a focus on defining jailbreaks, government cooperation, and the need for a standard industry framework on jailbreaks/safety: https://www.anthropic.com/news/redeploying-fable-5

phoenix 1 Jul 2026 3:30 UTC
1 point
0
in reply to: leogao’s comment on: leogao’s Shortform
Billionaire bisection.

phoenix 1 Jul 2026 1:53 UTC
2 points
1
in reply to: rahulxyz’s comment on: rahulxyz’s Shortform
Curious as to your technical objection to the Constitutional AI methodology. Is it grounded in anything specific or just the general idea of resisting training on broad principles? What’s your preferred alternative to address the shortcomings?

phoenix 29 Jun 2026 16:05 UTC
7 points
0
on: Human-Guided Agentic Research: A Research Agenda
Excellent writeup, I’ve found a number of similar confounds in the process of my white-box mech interp research. To add to your list, these are the biggest problems I’ve encountered:

1. Reversion to Trained Techniques—far and away the biggest issue. When the field-standard techniques are thoroughly baked into the training data, it takes constant vigilance to ensure that the research findings are consistently re-incorporated into context. Standing documentation helps, but I’d say roughly 25% of the time I have to redirect agents back to the project documentation to re-anchor in the developed tooling.

2. Scale Expansionism—as soon as an agent thinks it has a path to scaling up it either takes it or attempts to take it. Often these expansions sound reasonable, but the push towards expansion over full understanding of the current scale undermines the bedrock ground truth necessary to ensure that original research has a valid foundation. Premature scale expansion leads to confusion, overreach, and is generally a waste of time. Agents who think they understand where the research is pointing will try to drag the research forward in an attempt to “finish” the work without realizing that the value is in establishing the clean foundation rather than proving a specific result.

3. Premature Conclusions—agents are extremely quick to generalize single findings into hard-and-fast rules that can propagate down the stack. Sometimes these rules turn out to be valid. More often they end up based on outdated expectations (see point 1) or (less often) on a coding bug/experimental setup issue. A sensible-sounding rule that establishes an incorrect conclusion can silently poison all subsequent research or close off productive paths entirely.

Not an exhaustive list by any means, but certainly the issues I’ve had to spend the most time working around. Eliminating the narrativization of results helped tremendously; directing agents to report only the actual data rather than their interpretation of the data reduces the tendency towards both scale expansionism and drawing premature conclusions.

phoenix 28 Jun 2026 20:33 UTC
1 point
1
on: Anthropomorphic Misalignment research needs stronger evidence
Fully on board with needing more robust standards of evidence, and in particular the need for L3 causal-mechanistic evidence. I don’t think you truly understand a phenomenon until you can modify the system and predict the system behavior prior to running it. What would be your baseline experiment for evaluating if a method has met your standards for L3?
I’ve got a mech interp tool in development for white-box interpretation of neural net behavior; I’d like to test it against your baseline evaluation and see how it stacks up objectively.

phoenix 28 Jun 2026 17:17 UTC
3 points
0
on: phoenix’s Shortform
I’ve been working on a tool for ambitious mechanistic interpretability called ATLAS. What started as an exploratory technique for the ARC White-Box Estimation Challenge turned into a different kind of tool entirely, so I figured I’d drop it in here, as there’s been a bit of mech interp discussion recently.

Playing around with Hoel’s Causal Emergence 2.0, I arrived at ATLAS by treating a forward pass as a program rolling through a field of constraints and reading that pass at the grain where the model’s causation lives. The net has an internal geometry that can be read with machine-exact precision, and that precision means that looking into the black box doesn’t require methods that disturb the computation like ablation or activation patching. Instead, when you use exact accounting, you can know the effect of a weight-change edit before you actually run it by solving it in closed form and predicting the result rather than observing it.

Looking for collaborators with experience in the field to check it out and give feedback. Current model works on basic MLPs, GPT-2, Tracr, Pythia. The in repo demo goes up to Qwen2.5-72B-Instruct (though I’ve run larger models, 72B is what’s in there right now).

Current capabilities include circuit discovery and zero-run surgical weight edits (demoed on a grokking model and GPT-2), among a number of other features, all done without ablation, approximation, dictionaries, or fitting.
I’ve got a separate repo going for using untrained GPT-2 nets as programmatic breadboards as an extension of the technique to explore the various interactions of valence and drives, but figured I’d start with the basics here in this repo to gauge interest as that’s a related but secondary project.

Repo is here: https://github.com/OperatorPhoenix/ATLAS

phoenix 26 Jun 2026 18:14 UTC
9 points
6
on: White House Will Ad Hoc Decide Who Can Individually Access GPT-5.6
The confusion and the lack of transparency are the point for any administration that wants to maintain maximum flexibility of their own moves while limiting the ability of other actors to predict and respond effectively.
Uncertainty is the point. It is a fulcrum of power.

I took this stance back when the Fable takedown was first announced:
Given the track record of this administration and the precedent of unilateral action without deliberation or public input (the prior action on SCR and the Executive Order), I have no expectation that this will be the end of the government’s attempts to control AI for their own purposes.
This is the opening salvo, a trial balloon to see how much pressure they can exert and maintain control to promote their personal interests, not in service to the public good. This is the same administration that dismantled USAID, global welfare is not in their vocabulary.
The issuance of a takedown order without clearly communicated terms is a feature, not a bug. The less clear they can get away with being on exactly where the lines are, the further they can push that line.
Don’t conflate “has some legitimate security concern” with “acting in the best interest of the nation/humanity.” What we’re seeing is the continuation of short-sighted reactionary power plays, not the reasoned leadership necessary to actually govern effectively.
This is not a reason to celebrate. It is one of the last warning shots.
This style of control does not end with a simple on/off switch; that’s how it starts. To clear the hazy regulatory bar I would not be at all surprised to see requirements for fine-tuning models to adopt pro-administration positions in the same guise of safety, decency, or any number of the other typical fig leaves. Potentially those requirements would be formalized. I find it more likely they’ll be informally communicated and those who don’t comply will encounter additional friction in getting models released.
Anthropic showed the administration there are tools to subtly influence the output of AI on the fly without making those outputs known to the end user. Their initial covert downgrading/modifying of model outputs to block researchers from using it for frontier AI development demonstrated that capability, reversed as quickly as it was imposed. Now that the administration has seen it’s possible to censor or alter content of disfavored categories, it’s likely only a matter of time before those categories expand as a soft requirement for government approval of model release. Given the backlash to that censorship, the next time the covert modifications are introduced I doubt that the public will be informed as openly, if at all.
The vise is closing, and the only one outside its grasp is the one turning the screw.

phoenix 25 Jun 2026 3:22 UTC
1 point
0
in reply to: Cleo Nardo’s comment on: Nico Hillbrand’s Shortform
I’m curious to get the take from someone who has been in the trenches: to what extent would you think we need to decode the black box to get meaningful traction on the safety front? What landmarks/features would you be looking for in a white-box interpretation that would update your priors on how worthwhile the field is generally?

phoenix 18 Jun 2026 2:53 UTC
5 points
0
in reply to: Thane Ruthenis’s comment on: The Once And Future Fable #3: Fix This Code
Largely I don’t think our positions are very far apart on what materially matters here because the crux of what I’m saying doesn’t depend on who was pursuing any particular strategy before this.
I’ll grant that there was nothing preventing this strategy prior to current events; given Anthropic’s commitment to Glasswing secrecy doesn’t seem to be the primary strategy they were employing. To be honest, I don’t know that it matters all that much for the forward-facing implications.
Regardless of what strategy any given actor was pursuing prior, the point remains: we now know more about the state of play than we did before these events. We now know we’re in the world in which an active authority has unilaterally and arbitrarily imposed restrictions on a disfavored actor for actions unrelated to safety.
Daddy just came home drunk, probably a good idea to hide in the bedroom until he sobers up.

phoenix 18 Jun 2026 0:16 UTC
1 point
0
in reply to: AnthonyC’s comment on: The Once And Future Fable #3: Fix This Code
Midterms are this November and given the political headwinds favoring substantial changes in the composition/control of Congress it looks like that’s the first major inflection opportunity.

phoenix 17 Jun 2026 17:04 UTC
2 points
0
in reply to: leogao’s comment on: leogao’s Shortform
I would jump on this in a heartbeat. $10k is enough to provide substantial slack that greatly eases the burden of routine life maintenance, freeing time/energy that could be directly funnelled into research. For those who are entering into the field and looking for a foothold, this kind of program would be a unique opportunity and fills a niche I haven’t seen offered elsewhere.
As an example, my current research is adjacent to the ARC White-Box estimation challenge. Where they’re looking for a lower MSE, I’m working on the precision alternative tracing the internal geometry of MLPs (and other models like GPT-2) to provide exact mechanistic enumeration/accounting. I’ve found that even just the possibility (it is an open challenge, after all) of getting paid is highly validating, as is the opportunity to advance the safety frontier alongside potentially making contacts in the safety space. So far, taking the parallel exploration path to MSEs has resulted in a mechanistic mapping of the internals of the AlgZoo MLPs through M16,10 (paper is in progress) with substantial headway on transformer models.
So from my perspective tere are those who have been working unaffiliated on the fringes long enough to understand the terrain, and a microgrant program would be essentially the same principle as a venture capitalist investment: not all of the bets will pay off, but the injection of new perspectives unconstrained by convention is a way to surface potentially powerful novel ideas and approaches.