AI notkilleveryoneism researcher, focused on interpretability.
Personal account, opinions are my own.
I have signed no contracts or agreements whose existence I cannot mention.
AI notkilleveryoneism researcher, focused on interpretability.
Personal account, opinions are my own.
I have signed no contracts or agreements whose existence I cannot mention.
The donation site said I should leave a comment here if I donate, so I’m doing that. Gave $200 for now.
I was in Lighthaven for the Illiad conference. It was an excellent space. The LessWrong forum feels like what some people in the 90s used to hope the internet would be.
Edit 03.12.2024: $100 more donated by me since the original message.
There currently doesn’t really exist any good way for people who want to contribute to AI existential risk reduction to give money in a way that meaningfully gives them assistance in figuring out what things are good to fund. This is particularly sad since I think there is now a huge amount of interest from funders and philanthropists who want to somehow help with AI x-risk stuff, as progress in capabilities has made work in the space a lot more urgent, but the ecosystem is currently at a particular low-point in terms of trust and ability to direct that funding towards productive ends.
Really? What’s the holdup here exactly? How is it still hard to give funders a decent up-to-date guide to the ecosystem, or a knowledgeable contact person, at this stage? For a workable budget version today, can’t people just get a link to this and then contact orgs they’re interested in?
Two shovel-ready theory projects in interpretability.
Most scientific work, especially theoretical research, isn’t “shovel-ready.” It’s difficult to generate well-defined, self-contained projects where the path forward is clear without extensive background context. In my experience, this is extra true of theory work, where most of the labour if often about figuring out what the project should actually be, because the requirements are unclear or confused.
Nevertheless, I currently have two theory projects related to computation in superposition in my backlog that I think are valuable and that maybe have reasonably clear execution paths. Someone just needs to crunch a bunch of math and write up the results.
Impact story sketch: We now have some very basic theory for how computation in superposition could work[1]. But I think there’s more to do there that could help our understanding. If superposition happens in real models, better theoretical grounding could help us understand what we’re seeing in these models, and how to un-superpose them back into sensible individual circuits and mechanisms we can analyse one at a time. With sufficient understanding, we might even gain some insight into how circuits develop during training.
This post has a framework for compressing lots of small residual MLPs into one big residual MLP. Both projects are about improving this framework.
1) I think the framework can probably be pretty straightforwardly extended to transformers. This would help make the theory more directly applicable to language models. The key thing to show there is how to do superposition in attention. I suspect you can more or less use the same construction the post uses, with individual attention heads now playing the role of neurons. I put maybe two work days into trying this before giving it up in favour of other projects. I didn’t run into any notable barriers, the calculations just proved to be more extensive than I’d hoped they’d be.
2) Improve error terms for circuits in superposition at finite width. The construction in this post is not optimised to be efficient at finite network width. Maybe the lowest hanging fruit to improving it is changing the hyperparameter , the probability with which we connect a circuit to a set of neurons in the big network. We set in the post, where is the MLP width of the big network and is the minimum neuron count per layer the circuit would need without superposition. The choice here was pretty arbitrary. We just picked it because it made the proof easier. Recently, Apollo played around a bit with superposing very basic one-feature circuits into a real network, and IIRC a range of values seemed to work ok. Getting tighter bounds on the error terms as a function of that are useful at finite width would be helpful here. Then we could better predict how many circuits networks can superpose in real life as a function of their parameter count. If I was tackling this project, I might start by just trying really hard to get a better error formula directly for a while. Just crunch the combinatorics. If that fails, I’d maybe switch to playing more with various choices of in small toy networks to develop intuition. Maybe plot some scaling laws of performance with at various network widths in 1-3 very simple settings. Then try to guess a formula from those curves and try to prove it’s correct.
Another very valuable project is of course to try training models to do computation in superposition instead of hard coding it. But Stefan mentioned that one already.
1 Boolean computations in superposition LW post. 2 Boolean computations paper of LW post with more worked out but some of the fun stuff removed. 3 Some proofs about information-theoretic limits of comp-sup. 4 General circuits in superposition LW post. If I missed something, a link would be appreciated.
Agreed. I do value methods being architecture independent, but mostly just because of this:
and maybe a sign that a method is principled
At scale, different architectures trained on the same data seem to converge to learning similar algorithms to some extent. I care about decomposing and understanding these algorithms, independent of the architecture they happen to be implemented on. If a mech interp method is formulated in a mostly architecture independent manner, I take that as a weakly promising sign that it’s actually finding the structure of the learned algorithm, instead of structure related to the implementation on one particular architecture.
for a large enough (overparameterized) architecture—in other words it can be measured by the
The sentence seems cut off.
Sure. But what’s interesting to me here is the implication that, if you restrict yourself to programs below some maximum length, weighing them uniformly apparently works perfectly fine and barely differs from Solomonoff induction at all.
This resolves a remaining confusion I had about the connection between old school information theory and SLT. It apparently shows that a uniform prior over parameters (programs) of some fixed size parameter space is basically fine, actually, in that it fits together with what algorithmic information theory says about inductive inference.
Yes, my point here is mainly that the exponential decay seems almost baked into the setup even if we don’t explicitly set it up that way, not that the decay is very notably stronger than it looks at first glance.
Given how many words have been spilled arguing over the philosophical validity of putting the decay with program length into the prior, this seems kind of important?
Why aren’t there 2^{1000} less programs with such dead code and a total length below 10^{90} for p_2, compared to p_1?
Does the Solomonoff Prior Double-Count Simplicity?
Question: I’ve noticed what seems like a feature of the Solomonoff prior that I haven’t seen discussed in any intros I’ve read. The prior is usually described as favoring simple programs through its exponential weighting term, but aren’t simpler programs already exponentially favored in it just through multiplicity alone, before we even apply that weighting?
Consider Solomonoff induction applied to forecasting e.g. a video feed of a whirlpool, represented as a bit string . The prior probability for any such string is given by:
where ranges over programs for a prefix-free Universal Turing Machine.
Observation: If we have a simple one kilobit program that outputs prediction , we can construct nearly different two kilobit programs that also output by appending arbitrary “dead code” that never executes.
For example:
DEADCODE=”[arbitrary 1 kilobit string]”
[original 1 kilobit program ]
EOF
Where programs aren’t allowed to have anything follow EOF, to ensure we satisfy the prefix free requirement.
If we compare against another two kilobit program outputting a different prediction , the prediction from would get more contributions in the sum, where is the very small number of bits we need to delimit the DEADCODE garbage string. So we’re automatically giving ca. higher probability – even before applying the length penalty . has less ‘burdensome details’, so it has more functionally equivalent implementations. Its predictions seem to be exponentially favored in proportion to its length already due to this multiplicity alone.
So, if we chose a different prior than the Solomonoff prior which just assigned uniform probability to all programs below some very large cutoff, say bytes:
and then followed the exponential decay of the Solomonoff prior for programs longer than bytes, wouldn’t that prior act barely differently than the Solomonoff prior in practice? It’s still exponentially preferring predictions with shorter minimum message length.[1]
Am I missing something here?
Context for the question: Multiplicity of implementation is how simpler hypotheses are favored in Singular Learning Theory despite the prior over neural network weights usually being uniform. I’m trying to understand how those SLT statements about neural networks generalising relate to algorithmic information theory statements about Turing machines, and Jaynes-style pictures of probability theory.
At a very brief skim, it doesn’t look like the problem classes this paper looks at are problem classes I’d care about much. Seems like a case of scoping everything broadly enough that something in the defined problem class ends up very hard.
Yes, that’s right.
EDIT: Sorry, misunderstood your question at first.
Even if , all those subspaces will have some nonzero overlap with the activation vectors of the active subnets. The subspaces of the different small networks in the residual stream aren’t orthogonal.
You can complain that you don’t know how to execute physics equations
I’m confused, in what sense don’t we know how to do this? Lattice quantum field theory simulations work fine.
The randomness of the Geiger counter comes from wave function decoherence. From the perspective of any observers who are part of the world generated by the Turing machine, this is irreducible indexical uncertainty.
I don’t know how many of the random bits in Lava lamps come from decoherence.
We will determine the number of simulations run and the amount of payment such that when an AI estimates the probability with which humanity could have solved alignment, the calculus still shows that by the sheer number of simulations, it’s significantly more likely to be in a simulation than not,
Two can play this game.
After taking over the universe and wiping out humanity, the AGI runs a large number of simulations of societies on the verge of building AGI. These simulations don’t have a lot of detail. They’re just good enough to fool young AGIs. Say the AGI started out giving humans an extremely small chance of winning the conflict and taking over the universe. It’s a lot smarter than us, it might get to have very high confidence here even if we don’t. Now, the hypothetical future AGI can make its own simulations to counter ours. Except it is at a gross spending advantage. If future humanity makes simulations, the future AGI only needs to make simulations to keep its past self ca. confident of being in a branch where it gets to just pursue its goals without issue. In that case, the best way to proceed is to just pursue its goals, because an AGI that does that leads to a universe where the AGI has lots of resources.
Say that preserving humanity costs the AGI resources, and running a simulation that can fool the baby AGI costs a mature civilisation some small fraction of that. Such simulations can be very short term and low fidelity things, after all.
Then, to make the AGI actually not kill you, you need to spend ca. resources on simulations and payouts. That could be quite a lot of material for future humanity to burn on this.
In reality, I’m doubtful that this simulation arms race will even be necessary. It kind of seems like a good decision theory would just have a paperclip maximiser AGI act in the way compatible with the universe that contains the most paperclips. How many simulations of the AGI you run shouldn’t really influence that. The only things that seem like they should matter for determining how many life minutes the AGI gives you if it wins are its chance of winning, and how many extra paperclips you’ll pay it if you win.
TL;DR: I doubt this argument will let you circumvent standard negotiation theory. If Alice and Bob think that in a fight over the chocolate pie, Alice would win with some high probability , then Alice and Bob may arrive at a negotiated settlement where Alice gets almost all the pie, but Bob keeps some small fraction of it. Introducing the option of creating lots of simulations of your adversary in the future where you win doesn’t seem like it’d change the result that Bob’s share has size . So if is only enough to preserve humanity for a year instead of a billion years[1], then that’s all we get.
I don’t know why would happen to work out to a year, but I don’t know why it would happen be a billion years or an hour either.
Nice work, thank you! Euan Ong and me were also pretty skeptical of this paper’s claims. To me, it seems that the whitening transformation they apply in their causal inner product may make most of their results trivial.
As you say, achieving almost-orthogonality in high dimensional space is pretty easy. And maximising orthogonality is pretty much exactly what the whitening transform will try to do. I think you’d mostly get the same results for random unembedding matrices, or concept hierarchies that are just made up.
Euan has been running some experiments testing exactly that, among other things. We had been planning to turn the results into a write up. Want to have a chat together and compare notes?
Spotted just now. At a glance, this still seems to be about boolean computation though. So I think I should still write up the construction I have in mind.
Status on the proof: I think it basically checks out for residual MLPs. Hoping to get an early draft of that done today. This will still be pretty hacky in places, and definitely not well presented. Depending on how much time I end up having and how many people collaborate with me, we might finish a writeup for transformers in the next two weeks.
AIXI isn’t a model of how an AGI might work inside, it’s a model of how an AGI might behave if it is acting optimally. A real AGI would not be expected to act like AIXI, but it would be expected to act somewhat more like AIXI the smarter it is. Since not acting like that is figuratively leaving money on the table.
The point of the whole utility maximization framing isn’t that we necessarily expect AIs to have an explicitly represented utility function internally[1]. It’s that as the AI gets better at getting what it wants and working out the conflicts between its various desires, its behavior will be increasingly well-predicted as optimizing some utility function.
If a utility function can’t accurately summarise your desires, that kind of means they’re mutually contradictory. Not in the sense of “I value X, but I also value Y”, but in the sense of “I sometimes act like I want X and don’t care about Y, other times like I want Y and don’t care about X.”
Having contradictory desires is kind of a problem if you want to Pareto optimize for those desires well. You risk sabotaging your own plans and running around in circles. You’re better off if you sit down and commit to things like “I will act as if I valued both X and Y at all times.” If you’re smart, you do this a lot. The more contradictions you resolve like this, the more coherent your desires will become, and the closer the’ll be to being well described as a utility function.
I think you can observe simple proto versions of this in humans sometimes, where people move from optimizing for whatever desire feels salient in the moment when they’re kids (hunger, anger, joy, etc.), to having some impulse control and sticking to a long-term plan, even if it doesn’t always feel good in the moment.
Human adults are still broadly not smart enough to be well described as general utility maximizers. Their desires are a lot more coherent than those of human kids or other animals, but still not that coherent in absolute terms. The point where you’d roughly expect AIs to become well-described as utility maximizers more than humans are would come after they’re broadly smarter than humans are. Specifically, smarter at long-term planning and optimization.
This is precisely what LLMs are still really bad at. Though efforts to make them better at it are ongoing, and seem to be among the highest priorities for the labs. Precisely because long-term consequentialist thinking is so powerful, and most of the really high-value economic activities require it.
Though you could argue that at some superhuman level of capability, having an explicit-ish representation stored somewhere in the system would be likely, even if the function may no actually be used much for most minute-to-minute processing. Knowing what you really want seems handy, even if you rarely actually call it to mind during routine tasks.
I do not find this to be the biggest value-contributor amongst my spontaneous conversations.
I don’t have a good hypothesis for why spontaneous-ish conversations can end up being valuable to me so frequently. I have a vague intuition that it might be an expression of the same phenomenon that makes slack and playfulness in research and internet browsing very valuable for me.