Lucius Bushnaq

Karma: 4,091

AI notkilleveryoneism researcher, focused on interpretability.

Personal account, opinions are my own.

I have signed no contracts or agreements whose existence I cannot mention.

Lucius Bushnaq May 26, 2025, 5:40 AM
2 points
0
on: Superposition Without Compression: Why Entangled Representations Are the Default
You seem to be equating superposition and polysemanticity here, but they’re not the same thing.

Lucius Bushnaq May 23, 2025, 3:34 PM
LW: 13 AF: 7
2
AF
on: Reward button alignment
In other words, will the AGI actually want you to push the button? Or would it want some random weird thing because inner alignment is hard?
My answer is: yes, it would want you to push the button, at least if we’re talking about brain-like AGI, and if you set things up correctly.
Again, getting a brain-like AGI addicted to a reward button is a lot like getting a human or animal hooked on an addictive drug.
Humans addicted to drugs often exhibit weird meta-preferences like ‘I want to stop wanting the drug’, or ‘I want to find an even better kind of drug’.
For this reason, I am not at all confident that a smart thing exposed to the button would later generalise to coherent, super-smart thing that wants the button to be pressed. Maybe it perceived the circuits in it that bound to the button reward as foreign to the rest of its goals, and worked to remove them. Maybe the button binding generalised in a strange way.

‘Seek to directly inhabit the cognitive state caused by the button press’, ‘along an axis of cognitive states associated with button presses of various strength, seek to walk to a far end that does not actually correspond to any kind of button press ’, ‘make the world have a shape related to generalisations of ideas that tended to come up whenever the button was pressed’ and just generally ‘maximise a utility function made up of algorithmically simple combinations of button-related and pre-button-training-reward-related abstractions’ all seem like goals I could imagine a cognitively enhanced human button addict generalising toward. So I am not confident the AGI would generalise to wanting the button to be pushed either, not in the long term.

Lucius Bushnaq May 11, 2025, 2:53 PM
3 points
0
in reply to: AlphaAndOmega’s comment on: Lucius Bushnaq’s Shortform
Thank you. Do you know anyone who claims to have observed it?

Lucius Bushnaq May 10, 2025, 8:51 PM
20 points
8
on: Lucius Bushnaq’s Shortform
If terminal lucidity is a real phenomenon, information lost to dementia could still be recoverable in principle. So, cryo-preserving people suffering from dementia for later mind uploading could still work sometimes.
I just heard about terminal lucidity for the first time from Janus:
If your loved one is suffering from (even late-stage) dementia, it’s likely that the information of their mind isn’t lost, just inaccessible until a cure is found.
Sign them up for cryonics.
This seems pretty important if true. I’d previously thought that if a loved one came down with Alzheimer’s, that was likely the end for them in this branch of the world^[1], even with cryonics. I’d planned to set up some form of assisted suicide for myself if I was ever diagnosed, to get frozen before my brain got damaged too much.

Skimming the Wikipedia article and the first page of Google results, the documentation we have of terminal lucidity doesn’t seem great. But it tentatively looks to me like it’s probably a real thing at least in some form? Though I guess with the relative rarity of clearly documented cases, it might actually only work for some specific neurological disorders. I find it somewhat hard to imagine how something like this could work with a case of severe Alzheimer’s. Doesn’t that literally atrophy your brain?

This is very much not my wheelhouse though. I’d appreciate other people’s opinions, especially if they know something about this area of research.
1. ^
  It seems maybe possible in physical principle to bring back even minds lost to thermodynamic chaos. But that seems like an engineering undertaking so utterly massive I’m not sure even a mature civilisation controlling most of the lightcone could pull it off.

Lucius Bushnaq May 3, 2025, 4:47 PM
16 points
11
in reply to: Jeremy Gillen’s comment on: RA x ControlAI video: What if AI just keeps getting smarter?
I agree it’s not a valid argument. I’m not sure about ‘dishonest’ though. They could just be genuinely confused about this. I was surprised how many people in machine learning seem to think the universal approximation theorem explains why deep learning works.

Lucius Bushnaq Apr 25, 2025, 1:50 PM
12 points
8
in reply to: Alexander Gietelink Oldenziel’s comment on: Alexander Gietelink Oldenziel’s Shortform
Anecdotally, the effect of LLMs on my workflow hasn’t been very large.

Lucius Bushnaq Apr 24, 2025, 6:26 AM
28 points
6
in reply to: MichaelDickens’s comment on: MichaelDickens’s Shortform
At a moderate P(doom), say under 25%, from a selfish perspective it makes sense to accelerate AI if it increases the chance that you get to live forever, even if it increases your risk of dying. I have heard from some people that this is their motivation.
If this is you: Please just sign up for cryonics. It’s a much better immortality gambit than rushing for ASI.

Lucius Bushnaq Apr 22, 2025, 6:58 AM
13 points
11
on: AE Studio is hiring!
I like AE Studios. They seem to genuinely care about AI not killing everyone, and have been willing to actually back original research ideas that don’t fit into existing paradigms.

Side note:
Previous posts have been met with great reception by the likes of Eliezer Yudkowsky and Emmett Shear, so we’re up to something good.
This might be a joke, but just in case it’s not: I don’t think you should reason about your own alignment research agenda like this. I think Eliezer would probably be the first person to tell you that.

Lucius Bushnaq Apr 18, 2025, 8:23 PM
2 points
0
in reply to: Adam Newgas’s comment on: Lucius Bushnaq’s Shortform
But they’d be too unchanged: the “afraid of mice” circuit would still be checking for “grey and big and mammal and …” as the finetune dataset included no facts about animal fears. While some newer circuits formed during fine tuning would be checking for “grey and big and mammal and … and high-scrabble-scoring”. Any interpretability tool that told you that “grey and big and mammal and …” was “elephant” in the first model is now going to have difficulty representing the situation.
Thank you, this is a good example of a type-of-thing to watch out for in circuit interpretation. I had not thought of this before. I agree that an interpretability tool that rounded those two circuits off to taking in the ‘same’ feature would be a bad interpretability tool. It should just show you that those two circuits exist, and have some one dimensional features they care about, and those features are related but non-trivially distinct.

But this is not at all unique to the sort of model used in the counterexample. A ‘normal’ model can still have one embedding direction for elephant ${\to f}_{elephant}$ at one point, used by a circuit $C_{1}$ , then in fine tuning switch to a slightly different embedding direction ${\to f}_{{elephant}^{'}}$ . Maybe it learned more features in fine tuning, some of those features are correlated with elephants and ended up a bit too close in cosine similarity to ${\to f}_{elephant}$ , and so interference can be lowered my moving the embedding around a bit. A circuit $C_{2}$ learned in fine tuning would then be reading from this ${\to f}_{{elephant}^{'}}$ and not match $C_{1}$ which is still reading in ${\to f}_{elephant}$ . You might argue that $C_{1}$ will surely want to adjust to start using ${\to f}_{{elephant}^{'}}$ as well to lower the loss, but that would seem to apply equally well to your example. So I don’t see how this is showing that the model used in the original counterexample has no notion of an elephant in a sense that does not also apply to the sort of models people might tend to imagine when they think in the conventional SDL paradigm.

EDIT: On a second read, I think I misunderstood you here. You seem to think the crucial difference is that the delta between ${\to f}_{elephant}$ and ${\to f}_{{elephant}^{'}}$ is mostly ‘unstructured’, whereas the difference between “grey and big and mammal and …” and “grey and big and mammal and … and high-scrabble-scoring” is structured. I don’t see why that should matter though. So long as our hypothetical interpretability tool is precise enough to notice the size of the discrepancy between those features and not throw them into the same pot, we should be fine. For that, it wouldn’t seem to me to really matter much whether the discrepancy is ‘meaningful’ to the model or not.

Lucius Bushnaq Apr 18, 2025, 8:05 PM
2 points
0
in reply to: Adam Newgas’s comment on: Lucius Bushnaq’s Shortform
I’m with @chanind: If elephant is fully represented by a sum of its attributes, then it’s quite reasonable to say that the model has no fundamental notion of an elephant in that representation.
...
This is not a load-bearing detail of the example. If you like, you can instead imagine a model that embeds 1000 animals in an e.g. 50-dimensional subspace, with a 50 dimensional sub-sub-space where the embedding directions correspond to 50 attributes, and a 50 dimensional sub-sub-space where embeddings are just random.

This should still get you basically the same issues the original example did I think? For any dictionary decomposition of the activations you pick, some of the circuits will end up looking like a horrible mess, even though they’re secretly taking in a very low-rank subspace of the activations that’d make sense to us if we looked at it. I should probably double check that when I’m more awake though.^[1]

I think the central issue here is mostly just having some kind of non-random, ‘meaningful’ feature embedding geometry that the circuits care about, instead of random feature embeddings.
1. ^
  EDIT: I am now more awake. I still think this is right.

Lucius Bushnaq Apr 17, 2025, 3:08 PM
14 points
1
in reply to: quetzal_rainbow’s comment on: quetzal_rainbow’s Shortform
The kind of ‘alignment technique’ that successfully points a dumb model in the rough direction of doing the task you want in early training does not necessarily straightforwardly connect to the kind of ‘alignment technique’ that will keep a model pointed quite precisely in the direction you want after it gets smart and self-reflective.

For a maybe not-so-great example, human RL reward signals in the brain used to successfully train and aim human cognition from infancy to point at reproductive fitness. Before the distributional shift, our brains usually neither got completely stuck in reward-hack loops, nor used their cognitive labour for something completely unrelated to reproductive fitness. After the distributional shift, our brains still don’t get stuck in reward-hack loops that much and we successfully train to intelligent adulthood. But the alignment with reproductive fitness is gone, or at least far weaker.

Lucius Bushnaq Apr 17, 2025, 12:49 PM
2 points
0
in reply to: 4gravitons’s comment on: Lucius Bushnaq’s Shortform
How much money would you guess was lost on this?

Lucius Bushnaq Apr 17, 2025, 12:48 PM
2 points
0
in reply to: Oliver Clive-Griffin’s comment on: Lucius Bushnaq’s Shortform
Yes.

Lucius Bushnaq Apr 16, 2025, 8:32 PM
2 points
0
in reply to: Logan Riggs’s comment on: Lucius Bushnaq’s Shortform
Technically you didn’t specify that $c (x)$ can’t be an arbitrary function, so you’d be able to reconstruct activations combining different bases, but it’d be horribly convoluted in practice.
I wouldn’t even be too fussed about ‘horribly convoluted’ here. I’m saying it’s worse than that. We would still have a problem even if we allowed ourselves arbitrary encoder functions to define the activations in the dictionary and magically knew which ones to pick.

The problem here isn’t that we can’t make a dictionary that includes all the $1050$ feature directions $\to f$ as dictionary elements. We can do that. For example, while we can’t write
$\to a (x) = \sum_{i = 1}^{1000} c_{i} (x)_{i} + \sum_{i = 1}^{50} c_{i}^{'} (x) \to f_{i}^{'}$
because those sums each already equal $\to a (x)$ on their own, we can write
$\to a (x) = \sum_{i = 1}^{1000} \frac{c_{i} (x)}{2}_{i} + \sum_{i = 1}^{50} \frac{c_{i}^{'} (x)}{2} \to f_{i}^{'}$ .

The problem is instead that we can’t make a dictionary that has the $1050$ feature activations $c_{i} (x), c_{i}^{'} (x)$ as the coefficients in the dictionary. This is bad because it means our dictionary activations cannot equal the scalar variables the model’s own circuits actually care about. They cannot equal the ‘features of the model’ in the sense defined at the start, the scalar features comprising its ontology. As a result, if we were to look at a causal graph of the model, using the $1050$ half-size dictionary feature activations we picked as the graph nodes, a circuit taking in the feature $c_{elephant} (x)$ through a linear read-off along the direction ${\to f}_{elephant}$ would have edges in our graph connecting it to both the elephant direction, making up about 50% of the total contribution, and the fifty attribute directions, making up the remaining 50%. Same the other way around, any circuit reading in even a single attribute feature will have $1000$ edges connecting to all of the animal features^[1], making up $50 %$ of the total contribution. It’s the worst of both worlds. Every circuit looks like a mess now.
1. ^
  Since the animals are sparse, in practice this usually means edges to a small set of different animals for every data point. Whichever ones happen to be active at the time.

Lucius Bushnaq Apr 16, 2025, 11:54 AM
2 points
0
in reply to: chanind’s comment on: Lucius Bushnaq’s Shortform
E.g. it’s not possible to represent an elephant with any arbitrary combination of attributes, as the attributes themselves are what defines the elephant direction.
You can’t represent elephants along with arbitrary combinations of attributes. You can’t do that in a scheme where feature directions are fully random with no geometry either though. There, only a small number of features can have non-zero values at the same time, so you still only get $O (\sqrt{50})$ non-zero attribute features at once maximum.^[1]
We would want the dictionary to learn the attributes, not arbitrary combinations of attributes, since these are the true “base units” that can vary freely.
You can call them the “base units” if you like. But that won’t change the fact that some directions in the space spanned by those “base units” are special, with associated circuits that care about those directions in particular, and understanding or even recognising those circuits in a causal graph made of the “base units” will be pretty darned hard. For the same reason trying to understand the network in the neuron basis is hard.
Put another way, there’s no way to represent an “elephant” in this scheme without also attaching attributes to it.
Yes.
Likewise, it’s not possible to differentiate between an elephant with the set of attributes x y and z and a rabbit with identical attributes x y and z, since the sum of attributes are what you’re calling an elephant or rabbit.
Not quite. You cannot specify a rabbit and simultaneously specify the rabbit having arbitrary numerical attribute values for attributes $x, y, z$ differing from normal rabbits. You can have a rabbit, and some attributes $x, y, z$ treated as sparse boolean-ish features at the same time. E.g. $\to a = {\to f}_{rabbit} + {\to f}_{x} + {\to f}_{y} + {\to f}_{z}$ works. Circuits downstream that store facts about rabbits will still be triggered by this $\to a$ . Circuits downstream that do something with attribute $x$ will be reading in an $x$ -attribute value of $1$ plus the $x$ -coefficient of rabbits.
A consequence of this is that ‘cute rabbit’ is a bit cuter than either ‘cute’ or ‘rabbit’ on their own. But that doesn’t seem particularly strange to me. Associations in my own mind sure seem to work like that.
1. ^
  Less, if you want to be able to perform computation in superposition.

Lucius Bushnaq Apr 16, 2025, 9:24 AM
4 points
4
in reply to: Veedrac’s comment on: johnswentworth’s Shortform
Similarly, for people wanting to argue from the other direction, who might think a low current valuation is case-closed evidence against their success chances
To be clear: I think the investors would be wrong to think that AGI/ASI soon-ish isn’t pretty likely.

Lucius Bushnaq Apr 15, 2025, 10:39 PM
9 points
4
in reply to: johnswentworth’s comment on: johnswentworth’s Shortform
OpenAI’s valuation is very much reliant on being on a path to AGI in the not-too-distant future.
Really? I’m mostly ignorant on such matters, but I’d thought that their valuation seemed comically low compared to what I’d expect if their investors thought that OpenAI was likely to create anything close to a general superhuman AI system in the near future.^[1] I considered this evidence that they think all the AGI/ASI talk is just marketing.
1. ^
  Well ok, if they actually thought OpenAI would create superintelligence as I think of it, their valuation would plummet because giving people money to kill you with is dumb. But there’s this space in between total obliviousness and alarm, occupied by a few actually earnest AI optimists. And, it seems to me, not occupied by the big OpenAI investors.

Lucius Bushnaq Apr 15, 2025, 9:42 PM
2 points
0
in reply to: chanind’s comment on: Lucius Bushnaq’s Shortform
If I understand correctly, it sounds like you’re saying there is a “label” direction for each animal that’s separate from each of the attributes.
No, the animal vectors are all fully spanned by the fifty attribute features.
I’m confused why a dictionary that consists of a feature direction for each attribute and each animal label can’t explain these activations? These activations are just a (sparse) sum of these respective features, which are an animal label and a set of a few attributes, and all of these are (mostly) mutually orthogonal.
The animal features are sparse. The attribute features are not sparse.^[1]
In this sense the activations are just the sum of the various elements of the dictionary multiplied by a magnitude, so it seems like you should be able to explain these activations using dictionary learning.
The magnitudes in a dictionary seeking to decompose the activation vector into these 1050 features will not be able to match the actual magnitudes of the features $c_{i} (x), i = 1 \dots 1000, c_{i}^{'} (x), i = 1 \dots 50$ as seen by linear probes and the network’s own circuits.

Is the idea that the 1000 animals and 50 attributes form an overcomplete basis, therefore you can come up with infinite ways to span the space using these basis components?
No, that is not the idea.
1. ^
  Relative to the animal features at least. They could still be sparse relative to the rest of the network if this 50-dimensional animal subspace is rarely used.

Lucius Bushnaq Apr 15, 2025, 9:33 PM
2 points
0
in reply to: J Bostock’s comment on: Lucius Bushnaq’s Shortform
‘elephant’ would be a sum of fifty attribute feature vectors, all with scalar coefficients that match elephants in particular. The coefficients would tend have sizes on the order of $\frac{1}{\sqrt{50}}$ , because the subspace is fifty-dimensional. So, if you wanted to have a pure tiny feature and an elephant feature active at the same time to encode a tiny elephant, ‘elephant’ and ‘tiny’ would be expected to have read-off interference on the order of $\frac{1}{\sqrt{50}}$ . Alternatively, you could instead encode a new animal ‘tiny elephant’ as its own point in the fifty-dimensional space. Those are actually distinct things here. If this is confusing, maybe it helps to imagine that the name for ‘tiny elephant’ is ‘exampledon’, and exampledons just happen to look like tiny elephants.

Lucius Bushnaq Apr 15, 2025, 4:58 PM
6 points
2
in reply to: chanind’s comment on: Lucius Bushnaq’s Shortform
E.g. the concept of a “furry elephant” or a “tiny elephant” would be unrepresentable in this scheme
It’s representable. E.g. the model can learn a circuit reading in a direction that is equal to the sum of the furry attribute direction and the elephant direction, or the tiny direction and the elephant direction respectively. This circuit can then store facts about furry elephants or tiny elephants.
I feel like in this scheme, it’s not really the case that there’s 1000 animal directions, since the base unit is the attributes
In what sense? If you represent the network computations in terms of the attribute features, you will get a very complicated computational graph with lots of interaction lines going all over the place. So clearly, the attributes on their own are not a very good basis for understanding the network.
Similarly, you can always represent any neural network in the standard basis of the network architecture. Trivially, all features can be seen as mere combinations of these architectural ‘base units’. But if you try to understand what the network is doing in terms of interactions in the standard basis, you won’t get very far.
For there to be a true “elephant” direction, then it should be possible to have any set of arbitrary attributes attached to an elephant (small, furry, pink, etc...), and this would require that there is a “label” direction that indicates “elephant” that’s mostly orthogonal to every other feature so it can be queried uniquely via projection.
The ‘elephant’ feature in this setting is mostly-orthogonal to every other feature in the ontology, including the features that are attributes. So it can be read out with a linear projection. ‘elephant’ and ‘pink’ shouldn’t have substantially higher cosine similarity than ‘elephant’ and ‘parrot’.