AI notkilleveryoneism researcher, focused on interpretability.
Personal account, opinions are my own.
I have signed no contracts or agreements whose existence I cannot mention.
AI notkilleveryoneism researcher, focused on interpretability.
Personal account, opinions are my own.
I have signed no contracts or agreements whose existence I cannot mention.
I don’t think it proves too much. Informed decision-making comes in degrees, and some domains are just harder? Like, I think my threshold for leaving people free to make their own mistakes if they are the only ones harmed by them is very low, compared to where the human population average seems to be at the moment. But my threshold is, in fact, greater than zero.
For example, there are a bunch of things I think bystanders should generally prevent four year old human children from doing, even if the children insist that they want to do them. I know that stopping four year old children from doing these things will be detrimental in some cases, and that having such policies is degrading to the childrens’ agency. I remember what it was like being four years old and feeling miserable because of kindergarten teachers who controlled my day and thought they knew what was best for me. I still think the tradeoff is worth it on net in some cases.
I just think that the suicide thing happens to be a case where doing informed decision-making is maybe just too tough for way too many humans and thus some form of ban could plausibly be worth it on net. Sports betting is another case where I was eventually convinced that maybe a legal ban of some form could be worth it.
I think very very many people are not making an informed decision when they decide to commit suicide.
For example, I think quantum immortality is quite plausibly a thing. Very few people know about quantum immortality and even fewer have seriously thought about it. This means that almost everyone on the planet might have a very mistaken model of what suicide actually does to their anticipated experience.[1] Also, many people are religious and believe in a pleasant afterlife. Many people considering suicide are mentally ill in a way that compromises their decision making. Many people think transhumanism is impossible and won’t arrange for their brain to be frozen for that reason.
I agree that there is some threshold on the fraction of ill-considered suicides relative to total suicides such that suicide should be legal if we were below that threshold. I used to think we were maybe below that threshold. After I began studying physics at uni and so started taking quantum immortality more seriously, I switched to thinking we are maybe above the threshold.
You might find yourself in a branch where your suicide attempt failed, but a lot of your body and mind were still destroyed. If you keep exponentially decreasing the amplitude of your anticipated future experience in the universal wave function further, you might eventually find that it is now dominated by contributions from weird places and branches far-off in spacetime or configuration space that were formerly negligible, like aliens simulating you for some negotiation or other purpose.
I don’t really know yet how to reason well about what exactly the most likely observed outcome would be here. I do expect that by default, without understanding and careful engineering our civilisation doesn’t remotely have the capability for yet, it’d tend to be very Not Good.
Assuming that the bits to parameters encoding can be relaxed, there’s some literature about redundant computations in neural networks. If the feature vectors in a weight matrix aren’t linearly independent, for example, the same computation can be “spread” over many linearly dependent features, with the result that there are no free parameters but the total amount of computational work is the same.
There’s a few other cases like this where we know how various specific forms of simplicity in the computation map onto freedom in the parameters. But those are not enough in this case. We need more freedom than that.
If every bit of every weight were somehow used to store one bit of , excepting those weights used to simulate the UTM, that should suffice to derive the conjecture, yes.[1]
I think that’s maybe even harder than what I tried to do though. It’s theoretically fine if our scheme is kind of inefficient in terms of how much code it can store in a given number of parameters, so long as the leftover parameter description bits are free to vary.
There’d be some extra trickiness in that under these definitions, the parameters are technically real numbers and thus have infinity bits of storage capacity, though in real life they’re of course actually finite precision floating point numbers.
The same could be said of transformers run in chain of thought mode. But I tried deriving conjecture 2 for those, and didn’t quite succeed.
The trouble is that you need to store the programs in the RNN/transformer weights, and do it in a way that doesn’t ‘waste’ degrees of freedom. Suppose for example that we try to store the code for the programs in the MLPs, using one ReLU neuron to encode each bit via query/key lookups. Then, if we have more neurons than we need because the program is short, we have a lot of freedom in choosing the weights and biases of those unnecessary neurons. For example, we could set their biases to some very negative value to ensure the neurons never fire, and then set their input and output weights to pretty much any values. So long as the weights stay small enough to not overwhelm the bias, the computation of the network won’t be affected by this, since the ReLUs never fire.
The problem is that this isn’t enough freedom. To get in the formula without a prefactor , we’d need the biases and weights for those neurons to be completely free, able to take any value in .
EDIT: Wrote the comment before your edit. No, I haven’t tried it for RNNs.
(Mediation)
Wait, so it’s enough for the agents to just believe the observables are independent given the state of their latents? We only need the to be independent conditional on under a particular model ?
I didn’t realise that. I thought the observables had to be ‘actually independent’ after conditioning in some sort of frequentist sense.
Getting a version of this that works under approximate Agreement on Observables sounds like it would be very powerful then. It’d mean that even if Alice is much smarter than Bob, with her model e.g. having more FLOP which she can use to squeeze more bits of information out of the data, there’d still need to be a mapping between the concepts Bob and Alice internally use in those domains where Bob doesn’t do very much worse than Alice on predictive accuracy.
So, if a superintelligence isn’t that much better than humanity at modelling some specific part of reality, there’d need to be an approximate mapping between humanity’s latents and (some of) the superintelligence’s latents for that part of reality. If the the theorems approximately hold under approximate agreement on observables.
I am not so concerned about people getting extra time, as for most tests the deadline should be a mercy to prevent students from staying there for days, rather than costing you a lot of points.
I can’t recall ever taking a test in school or university where time wasn’t a pretty scarce resource, unless it was easy enough that I could just get everything right before the deadline without needing to rush.
Dumbledore likely would have known what it meant, and I think Alastor at the very least would have put together the most crucial parts as well.
The part that was numb with grief and guilt took this opportunity to observe, speaking of obliviousness, that after events at Hogwarts had turned serious, they really really really REALLY should have reconsidered the decision made on First Thursday, at the behest of Professor McGonagall, not to tell Dumbledore about the sense of doom that Harry got around Professor Quirrell. It was true that Harry hadn’t been sure who to trust, there was a long stretch where it had seemed plausible that Dumbledore was the bad guy and Professor Quirrell the heroic opposition, but...
Dumbledore would have realised.
Dumbledore would have realised instantly.
For me that fell under ‘My simulation of Voldemort isn’t buying that he can rely on this, not for something so crucial.’
And the answer was: “All right. There is a curse on the Defence Professor position. There has always been a curse on the Defence Professor position. The school has adapted to it. Harry has gotten into just the right kind of shenanigan to cause McGonagall to panic about this, and give Harry the instructions he needs to hear to prevent him from just taking certain matters to McGonagall.”
The question I always had here was “But what was Voldemort’s original plan for dealing with this issue when he decided to teach at Hogwarts?”
Because I don’t think he would have wanted to stake all his plans for the stone and Harry on McGonagall coincidentally saying this just in time, and Harry coincidentally being in a state where he obeys her instruction and never rethinks that decision. And Voldemort would have definitely known about the resonance problem before coming to Hogwarts. Even if he thought it would be somehow gone after ten years, he would have realised after the encounter with Harry in Diagon Alley at the very latest that that wasn’t true. So what was his original plan for making sure Harry wouldn’t talk about the resonance to anyone important? Between the vow and the resonance itself, his means of reliably controlling Harry’s actions are really very sharply limited.
Every plan I’ve managed to come up with either doesn’t fit with Voldemort’s actual actions in the story, or doesn’t seem nearly reliable enough for my mental model of Voldemort to be satisfied with the whole crazy “Let’s just walk into Hogwarts, become a teacher, and hang out there for maybe a year” idea.
Eliezer: Right. But there’s more! This model also explains why, when Harry faces the Dementor and is lost in his dark side, and Hermione brings him out of it with a kiss,[18] Harry’s dark side has nothing to say about that kiss, it’s at a loss. Meanwhile, the main part of Harry has a thought process activated.
I picked up on this, though my main guess was that Tom Riddle had just always been aromantic and asexual. I didn’t think any dark rituals were involved.
I do not think that Noosphere’s comment did not contain an argument. The rest of the comment after the passage you cited tries to lay out a model for why continual learning and long-term memory might be the only remaining bottlenecks. Perhaps you think that this argument is very bad, but it is an argument, and I did not think that your reply to it was helpful for the discussion.
My guess is this is obvious, but IMO it seems extremely unlikely to me that bee-experience is remotely as important to care about as cow experience.
I agree with this, but would strike the ‘extremely’. I don’t actually have gears level models for how some algorithms produce qualia. ‘Something something, self modelling systems, strange loops’ is not a gears level model. I mostly don’t think a million neuron bee brain would be doing qualia, but I wouldn’t say I’m extremely confident.
Consequently, I don’t think people who say bees are likely to be conscious are so incredibly obviously making a mistake that we have to go looking for some signalling explanation for them producing those words.
But there’s no reason to think that the model is actually using a sparse set of components /features on any given forward pass.
I contest this. If a model wants to implement more computations (for example, logic gates) in a layer than that layer has neurons, the known methods for doing this rely on few computations being used (that is, receiving a non-baseline input) on any given forward pass.
I’d have to think about the exact setup here to make sure there’s no weird caveats, but my first thought is that for , this ought to be one component per bigram, firing exclusively for that bigram.
An intuition pump: Imagine the case of two scalar features being embedded along vectors . If you consider a series that starts with being orthogonal, then gives them ever higher cosine similarity, I’d expect the network to have ever more trouble learning to read out , , until we hit , at which point the network definitely cannot learn to read the features out at all. I don’t know how the learning difficulty behaves over this series exactly, but it sure seems to me like it ought to go up monotonically at least.
Another intuition pump: The higher the cosine similarity between the features, the larger the norm of the rows of will be, with norm infinity in the limit of cosine similarity going to one.
I agree that at cosine similarity , it’s very unlikely to be a big deal yet.
Sure, yes, that’s right. But I still wouldn’t take this to be equivalent to our literally being orthogonal, because the trained network itself might not perfectly learn this transformation.
What do you mean by “a global linear transformation” as in what kinds of linear transformations are there other than this? If we have an MLP consisting of multiple computations going on in superposition (your sense) I would hope that the W_in would be decomposed into co-activating subcomponents corresponding to features being read into computations, and the W_out would also be decomposed into co-activating subcomponents corresponding to the outputs of those computations being read back into the residual stream. The fact that this doesn’t happen tells me something is wrong.
Linear transformations that are the sum of weights for different circuits in superposition, for example.
What I am trying to say is that I expect networks to implement computation in superposition by linearly adding many different subcomponents to create W_in, but I mostly do not expect networks to create W_out by linearly adding many different subcomponents that each read-out a particular circuit output back into the residual stream, because that’s actually an incredibly noisy operation. I made this mistake at first as well. This post still has a faulty construction for W_out because of my error. Linda Linsefors finally corrected me on this a couple months ago.
As to the issue with the maximum number of components: it seems to me like if you have five sparse features (in something like the SAE sense) in superposition and you apply a rotation (or reflection, or identity transformation) then the important information would be contained in a set of five rank 1 transformations, basically a set of maps from A to B. This doesn’t happen for the identity, does it happen for a rotation or reflection?
I disagree that if all we’re doing is applying a linear transformation to the entire space of superposed features, rather than, say, performing different computations on the five different features, that it would be desirable to split this linear transformation into the five features.
Finally, as to “introducing noise” by doing things other than a global linear transformation, where have you seen evidence for this? On synthetic (and thus clean) datasets, or actually in real datasets? In real scenarios, your model will (I strongly believe) be set up such that the “noise” between interfering features is actually helpful for model performance, since the world has lots of structure which can be captured in the particular permutation in which you embed your overcomplete feature set into a lower dimensional space.
Uh, I think this would be a longer discussion than I feel up for at the moment, but I disagree with your prediction. I agree that the representational geometry in the model will be important and that it will be set up to help the model, but interference of circuits in superposition cannot be arranged to be helpful in full generality. If it were, I would take that as pretty strong evidence that whatever is going on in the model is not well-described by the framework of superposition at all.
If you have 100 orthogonal linear probes to read with, yes. But since there’s only 50 neurons, the actual circuits for different input features in the network will have interference to deal with.
My understanding is that SPD cannot decompose an matrix into more than subcomponents, and if all subcomponents are “live” i.e. active on a decent fraction of the inputs, then it will have to have components to work
SPD can decompose an matrix into more than subcomponents.
I guess there aren’t any toy models in this paper that directly showcase this, but I’m pretty confident it’s true, because
I don’t see why it wouldn’t be able to.
I’ve decomposed a weight matrix in a tiny LLM and got out way more than live subcomponents. That’s a very preliminary result though, you probably shouldn’t put that much stock in it.
Edit: as you pointed out, this might only apply when there’s not a nonlinearity after the weight. But every in a transformer has a connection running from it directly to the output logits through . So SPD will struggle to interpret any of the output weights of transformer MLPs. This seems bad.
I think it’s the other way around. If you try to implement computation in superposition in a network with a residual stream, you will find that about the best thing you can do with the is often to just use it as a global linear transformation. Most other things you might try to do with it drastically increases noise for not much pay-off. In the cases where networks are doing that, I would want SPD to show us this global linear transform.
But is reading those vectors off a 1000-dimensional vector space where there’s no interference between features.
They’re embedded randomly in the space, so there is interference between them in the sense of them having non-zero inner products.
Thanks to CCi(p)nCiS, we know that the toy model is not even doing computation in superposition, which is the case which SPD seems to be based on. It’s actually doing something really weird with the “noise”, which doesn’t actually behave well.
Yes. I agree that this makes the model not as great a testbed as we originally hoped.
I agree that quantum mechanics is not really central for this on a philosophical level. You get a pretty similar dynamic just from having a universe that is large enough to contain many almost-identical copies of you. It’s just that it seems at present very unclear and arguable whether the physical universe is in fact anywhere near that large, whereas I would claim that a universal wavefunction which constantly decoheres into different branches containing different versions of us is pretty strongly implied to be a thing by the laws of physics as we currently understand them.
It is very late here and I should really sleep instead of discussing this, so I won’t be able to reply as in-depth as this probably merits. But, basically, I would claim that this is not the right way to do expected utility calculations when it comes to ensembles of identical or almost-identical minds.
A series of thought experiments might maybe help illustrate part of where my position comes from:
Imagine someone tells you that they will put you to sleep and then make two copies of you, identical down to the molecular level. They will place you in a room with blue walls. They will place one copy of you in a room with red walls, and the other copy in another room with blue walls. Then they will wake all three of you up.
What color do you anticipate seeing after you wake up, and with what probability?
I’d say 2⁄3 blue, 1⁄3 red. Because there will now be three versions of me, and until I look at the walls I won’t know which one I am.
Imagine someone tells you that they will put you to sleep and then make two copies of you. One copy will not include a brain. It’s just a dead body with an empty skull. Another copy will be identical to you down to the molecular level. Then they will place you in a room with blue walls, and the living copy in a room with red walls. Then they will wake you and the living copy up.
What color do you anticipate seeing after you wake up, and with what probability? Is there a 1⁄3 probability that you ‘die’ and don’t experience waking up because you might end up ‘being’ the corpse-copy?
I’d say 1⁄2 blue, 1⁄2 red, and there is clearly no probability of me ‘dying’ and not experiencing waking up. It’s just a bunch of biomass that happens to be shaped like me.
As 2, but instead of creating the corpse-copy without a brain, it is created fully intact, then its brain is destroyed while it is still unconscious. Should that change our anticipated experience? Do we now have a 1⁄3 chance of dying in the sense that we might not experience waking up? Is there some other relevant sense in which we die, even if it does not affect our anticipated experience?
I’d say no and no. This scenario is identical to 2 in terms of the relevant information processing that is actually occurring. The corpse-copy will have a brain, but it will never get to use it, so it won’t affect my expected anticipated experience in any way. Adding more dead copies doesn’t change my anticipated experience either. My best scoring prediction will be that I have 1⁄2 chance of waking up to see red walls, and 1⁄2 chance of waking up to see blue walls.
In real life, if you die in the vast majority of branches caused by some event, i.e. that’s where the majority of the amplitude is, but you survive in some, the calculation for your anticipated experience would seem to not include the branches where you die for the same reason it doesn’t include the dead copies in thought experiments 2 and 3.
(I think Eliezer may have written about this somewhere as well using pretty similar arguments, maybe in the quantum physics sequence, but I can’t find it right now.)