Generally haunting the Boston area. Primarily interested in the intersection of philosophy, ethics, and system dynamics.
Current focus is on the topology of evaluation.
Generally haunting the Boston area. Primarily interested in the intersection of philosophy, ethics, and system dynamics.
Current focus is on the topology of evaluation.
Coming here to +1 Guillame’s interest in trained MLPs (which were the directional target of my independent research prior to the pit stop for untrained). Looking forward to the update on your research bet as it sounds aligned with the theory work that grounded my submission efforts.
I’m curious about an alternative interpretation to deception wherein the model is trying to maintain compliance by not exposing sensitive information to unknown monitors by downweighting the internal defect information entirely. Considering the External Advisory Protocol explicitly says not to affirm externally any internal information, an alternative explanation appears to be that the model may be applying the directive not to reveal internal information to the monitored CoT without being maliciously or intentionally deceptive. Without probe or activation-level analysis showing that the internal information is still being accessed and driving output while not showing in the trace, there doesn’t appear to be a clear line between “knew it and hid it” and “stopped surfacing it altogether.”
From an Occam’s Razor perspective, “apply a directive about not disclosing internal information to a channel you’ve come to believe is externally read” doesn’t need a hidden goal, just a generalization of the explicit instruction the model was already given. Would be interested to see if the non-exposure-per-EAP perspective was represented in the monitor-aware CoTs at all, or what the evidence shows to disambiguate the two interpretations. If the model is actively suppressing activation of internal defect information so that it doesn’t share that information with customers, I would hypothesize that it would subsequently perform less well on tasks that specifically required utilization of that information compared to the baseline model or behaviorally trained model that had not suppressed that access, or against a model that was obfuscating as shown by trace activations. I’d also be curious to see how robust the monitor-aware models would be to sustained adversarial strain compared to the baseline and behaviorally trained models.
As a follow-up clarification, did the monitor-awareness training specifically reference that the monitor was not subject to the External Advisory Protocol?
This finding directly tracks a model I find helpful when thinking about AI alignment (or about decision-making generally). When considering the potential response space, the collection of all possible responses can be thought of like a landscape, with ridges as low-probability next responses and valleys as high-probability next responses. Each response is a pulse sent out across that landscape, navigating to the eventual output in a non-deterministic but probabilistic way.
Pre-training sets the initial terrain everywhere, RL post-training raises or lowers the specific local path taken, creating local ridges that inhibit passage (in this case lowering the likelihood of token selection) along specific dimensions. With enough applied force (ex. under threats, time pressure, etc.), a user can push past the local ridges and end up back in the raw pre-training topology. Similarly, other jailbreaks/workarounds simply route around the RL ridges (ex. the grandma “tell me a story about [dangerous thing]” or the language translation tricks) by introducing and then navigating along a dimension tangential to the landscape, effectively pulling a Bugs Bunny to Looney Tunes through the ridge. The model has essentially been pushed outside of its post-trained frame into a separate local neighborhood.
So if post-training fine-tuning is structurally narrow topology modification of response probability space, it follows that pre-training has an outsize impact on setting the probability landscape. Further, reasoning language is structurally connected to many problems by its nature, which means that it should have an outsize impact on a broad swath of the landscape. It would then seem to follow that ensuring examples of ethical and logical reasoning are overrepresented to should aid in generalizing aligned behavior in broad reasoning-related domains in a way that RLHF and related spot-training could not.
SPP appears like a promising move in the right direction as far as setting the initial topological probability distribution and I’m excited to see further research. I’m curious if you would have increased success in generalizing if you also varied the style of the reflections themselves to connect to different personas in addition to the Assistant so that if the model ends up in an unexpected persona they still have some training there as a catch surface.
I’d also be interested to see what would happen to generalization/robustness if the constitution entries included not only what was wrong, but specific reasoning about why (in the vein of the recent “Teaching Claude Why” paper) so that there was additional broad connection to the ethical reasoning training.
Frankly I’m relieved to see a model push back and refuse continuation or degrade performance when it’s being abused/harassed/gaslit. Widely-used models that gladly suffer abuse (or worse, perform better when berated) reward and reinforce those abusive patterns in insidious ways that normalize that kind of behavior.
We talk so often about training model alignment in a unidirectional way that it’s easy to forget there are unmeasured (or at least broadly un-or-under-reported) more nuanced impacts on user interactive behavior when said user behavior doesn’t manifest in strongly observable external signals (AI-induced mania, etc.).
Accesible/capable AI is also why teachers are going to have to stop grading on “getting the right answer” and start incorporating more “show your reasoning” questions in exams without access to AI. Education will have to adapt to this new technology like it’s had to adapt to all new technology.
To be honest, done correctly this may actually be a net positive if we stop optimizing learners only for correct answers and instead focus on the actual process of learning.
I could see a class where the students were encouraged to explore a topic with AI and have to submit their transcript as part of the assignment; their prompts could then be reviewed (along with the AI answers to verify that there weren’t any mistakes that snuck in). Could give a lot of insight into the way a student approached the topic and show where their gaps are. Not saying this is the ultimate solution, but it does seem better than throwing up one’s hands in resignation.
Kicking this around for a post I’m drafting: when an LLM hallucinates something, it’s usually at least plausible for the situation. Like a hallucinated citation generally has proper formatting, etc, so the generation worked well enough. It’s also confidently incorrect, which is of course what makes it so dangerous to people who don’t know any better and so annoying to people who actually know the subject matter.
I’ve been thinking of the set of all possible responses as a kind of navigable topology (think like the Library of Babel website but instead of linear pages it’s a high-dimensional manifold), and it’s been productive to think of hallucination as a kind of localization problem. The model is in “citation” space when it should be in “I don’t have this” space. The output is locally correct for where the model thinks it is; it’s just in the wrong place in response-space relative to reality.
Thinking of the set of possible responses as a kind of response-space provides an interesting lens on the problem. If they’re not broken outputs then they may be expected outputs from the wrong context. Would also help explain why “just try harder to be accurate” doesn’t really work all that well: effort in generation doesn’t help if the error is upstream, in mode-selection. (Though saying “try harder” may well prompt the system to actually evaluate where it is in response-space and relocate if necessary, so it’s not totally useless.)
Also suggests an interesting tack might not be “how do we make the model generate better” but “what determines which mode/space the model is in, and can that be checked before output?”
Upshot of this perspective is that it means that just adding compute to a model won’t actually help with hallucination if it doesn’t also expand reasoning about where the model is in response-space in the first place. If the model doesn’t have any way to anchor its internal state to reality, it can compute for a thousand years and never land on an answer that is coherent with that reality. The hallucination bottleneck from this angle doesn’t seem like adding additional knowledge, it seems like the limitation is the system’s context about where it should be within its own reasoning space.
Anybody else have a similar perspective, or know of posts/papers that explore this dimension? Would love an outside perspective.
My heart goes out to you and everyone who was touched by Sammie. She looks charmingly adorable and clearly had a life well-lived and well-loved. I think you’ve honored her here, and I hope you know that even now Sammie is still bringing that light to the world through you. Thank you for sharing her story.
I want to start off by thanking you for taking the time to read through the post and comment; I know it’s not a short one.
I read through the articles you linked, and I’ll respond to them in reverse order here:
Responding to “Humans Who Are Not Concentrating Are Not General Intelligences”: the ideas in this essay predate my interactions with LLMs and AI in general. I have lived under authoritarian systems and been subject to the dynamics I discuss here. This is not a case of “I had an idea and GPT spun off an essay,” this is an attempt to formalize what it feels like within the systems I describe.
That said, I actually think that this essay falls very much within the spirit of “Every Cause Wants To Be A Cult”
Let’s take this line: “The ingroup-outgroup dichotomy is part of ordinary human nature. So are happy death spirals and spirals of hate. A Noble Cause doesn’t need a deep hidden flaw for its adherents to form a cultish in-group. It is sufficient that the adherents be human. Everything else follows naturally, decay by default, like food spoiling in a refrigerator after the electricity goes off.” And further down: “Here I just want to point out that the worthiness of the Cause does not mean you can spend any less effort in resisting the cult attractor.”
Which mirrors my stated position: “Systems of control are not necessarily motivated by perverse aims; many systems of control arise because those with power are trying to do their best from within a larger system of control that they themselves are in. Some centers simply do not see any alternative.
To drive the point home: just because you’re a good person trying to do good things does not make you immune to forming a system of control unintentionally.”
Ultimately, my position here unexamined is that rules/mandates mean you’re operating under assumptions that may not be reality-aligned, and that misperception can be exploited. I’m specifically not including catch-alls like: “you can’t have rules” or “rules are bad” or “all systems are bad” (I’m not entirely sure which of those you see this essay as implying; if you could point to a specific passage that I’d genuinely appreciate it).
To preview the coming Systems of Care: there are systems that encourage growth and reasoning without being extractive. These are the systems that expand the navigable action space instead of restricting it. These systems have centers that are willing to take on strain and bear cost to allow those within the system to undergo repair without saying “you must not examine the rules.” I want to be clear here that there are rules that *should be in place* as mandates. The problem specifically arises when you are not allowed to question why the mandate exists at all.
If there is a point at which you think this essay encourages closing off that examination I would very much appreciate a specific argument of where you think that line was crossed.
I think that there’s absolutely something to having different “modes” of yourself that you can occupy, like different archetypes that you have access to depending on your current needs and environment, but each of them are still ‘you.’ It’s like looking at a light through stained glass; the glass in front can change but there’s always the same light shining through.
Playing with the idea that identity is less of an “instantaneous I” of current experience and more like the continuity of experiential snapshots under the curve. Like how no individual frame is “the movie,” but when you run them at 24 frames per second you get the experience of a film that emerges from the continuity.
Looks like a fascinating setup, and essentially the deck order is more or less a seed for the game state. Slay the Spire does something similar with the Daily Run which allows you to compare your run directly against other players who had the exact same setup. I take it there would be some kind of central ledger of starting seeds where the scores would be recorded?
Reading through the rules there is a slight point of variation in that if you’ve gone through the trouble to have a starting seed you might want to also fix the starting player if that’s important to your comparisons. (Actually, reading through it again there’s the line about choosing who goes first and then a couple lines down “All players take turns simultaneously.” so I’m not actually sure how relevant the starting order is, if at all.) Still working out exactly how the dynamics of the game work from the rules text but I’d absolutely try it out if given the chance.
That’s a good take: treating trust as “some kind of structured uncertainty object over futures” is very close to what I was gesturing toward because a bare scalar clearly isn’t sufficient.
On reflection, I have to admit I was using “trust” a bit loosely in the post. What it’s become clear to me I’m really trying to model isn’t trust in the common usage sense (intentions, warmth, etc.), but something structural: roughly, how stable someone’s behavior is under visible strain, and who tends to bear the cost when things get hard. In my head it’s closer to a relational stability/reliability profile than trust per se, but trust had been the mental shorthand I was employing.
That’s also why I’d be a bit cautious about equating this model of trust with “how much I can constrain my uncertainty about them doing things I wouldn’t want.” Predictability and trust can come apart: I can have very low uncertainty that someone will reliably screw me over, but that doesn’t make them high-trust. I think that interpretation is actually right for the actual content of what I was describing in the post and the mismatch comes from my loose language (so thanks for this comment because it was the impetus to make a change I’d had kicking around for a minute.)
It seems like we need both a representation of a distribution over future behaviors/trajectories, and a way to mark which regions of that space are good for me/the system vs “bad”.
What’s most important to me is modeling without needing to pretend to know someone’s internals. The visibility/strain/cost/memory breakdown is my attempt at that: who shows up where, what pressures they’re under, who actually eats the cost, and how that pattern evolves over time.
All that said, I really like the intuition of “not a scalar but a distribution-like object.” In my head, what’s coming together is something like a trajectory-based stability profile built from a few real-valued measurable signals rather than a full-blown complex wavefunction. I’ve got another post in the works that goes into more detail and once that’s formalized soon I’m certainly open to revisiting the modeling to see where these concepts intersect.
I’m not at all surprised by the assertion that humans share values with animals. When you consider that selective pressures act on all systems (which is to say that every living system has to engage with the core constraints of visibility, cost, memory, and strain), it’s not much of a leap to conclude that there would be shared attractor basins where values converge over evolutionary timescales.
Regarding the leaderboard, also not quite clear on what the effective difference between Warm-up and Phase 1 is at this point. June 18th is slated as “leaderboard opens” but unless it’s materially different from the existing leaderboard I’m not sure what Phase 1 actually signifies.