AI notkilleveryoneism researcher, focused on interpretability.
Personal account, opinions are my own.
I have signed no contracts or agreements whose existence I cannot mention.
AI notkilleveryoneism researcher, focused on interpretability.
Personal account, opinions are my own.
I have signed no contracts or agreements whose existence I cannot mention.
I agree that this seems maybe useful for some things, but not for the “Which UTM?” question in the context of debates about Solomonoff induction specifically, and I think that’s the “Which UTM?” question we are actually kind of philosophically confused about. I don’t think we are philosophically confused about which UTM to use in the context of us already knowing some physics and wanting to incorporate that knowledge into the UTM pick, we’re confused about how to pick if we don’t have any information at all yet.
Attempted abstraction and generalization: If we don’t know what the ideal UTM is, we can start with some arbitrary UTM , and use it to predict the world for a while. After (we think) we’ve gotten most of our prediction mistakes out of the way, we can then look at our current posterior, and ask which other UTM might have updated to that posterior faster, using less bits of observation about (our universe/the string we’re predicting). You could think of this as a way to define what the ‘correct’ UTM is. But I don’t find that definition very satisfying, because the validity of this procedure for finding a good depends on how correct the posterior we’ve converged on with our previous, arbitrary, is. ‘The best UTM is the one that figures out the right answer the fastest’ is true, but not very useful.
Is the thermodynamics angle gaining us any more than that for defining the ‘correct’ choice of UTM?
We used some general reasoning procedures to figure out some laws of physics and stuff about our universe. Now we’re basically asking what other general reasoning procedures might figure out stuff about our universe as fast or faster, conditional on our current understanding of our universe being correct.
Why does it make Bayesian model comparison harder? Wouldn’t you get explicit predicted probabilities for the data from any two models you train this way? I guess you do need to sample from the Gaussian in a few times for each and pass the result through the flow models, but that shouldn’t be too expensive.
Did that clarify?
Yes. Seems like a pretty strong assumption to me.
Yup, it sure does look similar. One tricky point here is that we’re trying to fit the ’s to the data, so if going that route we’d need to pick some parametetric form for .
Ah. In that case, are you sure you actually need to do the model comparisons you want? Do you even really need to work with this specific functional form at all? As opposed to e.g. training a model to feed its output into tiny normalizing flow models which then try to reconstruct the original input data with conditional probability distributions ?
To sketch out a little more what I mean, could e.g. be constructed as a parametrised function[1] which takes in the actual samples and returns the mean of a Gaussian, which is then sampled from in turn[2]. The would be constructed using normalising flow networks[3], which take in as well as uniform distributions over variables that have the same dimensionality as their . Since the networks are efficiently invertible, this gives you explicit representations of the conditional probabilities , which you can then fit to the actual data using KL-divergence.
You’d get explicit representations for both and from this.
Or ensemble of functions, if you want the mean of to be something like specifically.
Using reparameterization to keep the sampling operation differentiable in the mean.
If the dictionary of possible values of is small, you can also just use a more conventional ml setup which explicitly outputs probabilities for every possible value of every of course.
Trick I’m currently using: we can view the sum as taking an expectation of under a uniform distribution . Under that uniform distribution, is a sum of independent random variables, so let’s wave our hands just a little and assume that sum is approximately normal.
Not following this part. Can you elaborate?
Some scattered thoughts:
Regrading convergence, to state the probably obvious, since , at least has to go to zero for going to infinity.
In my field-theory-brained head, the analysis seems simpler to think about for continuous . So unless we’re married to being discrete, I’d switch from to . Then you can potentially use Gaussian integral and source-term tricks with the dependency on as well. If you haven’t already, you might want to look at (quantum) field theory textbooks that describe how to calculate expectation values of observables over path integrals. This expression looks extremely like the kind of thing you’d usually want to calculate with Feynman diagrams, except I’m not sure whether the have the right form to allow us to power expand in and then shove the non-quadratic terms into source derivatives the way we usually would in perturbative quantum field theory.
If all else fails, you can probably do it numerically, lattice-QFT style, using techniques like hybrid Monte Carlo to sample points in the integral efficiently.[1]
You can maybe also train a neural network to do the sampling.
Skimming some of the posts in the sequence, I am not persuaded that corrigibility now looks like an engineering problem rather than a problem that needs (a) major theoretical breakthrough(s).
The point about corrigibility MIRI keeps making is that it’s anti-natural, and Max seems to agree with that.
Are there any theorems that use SLT to quantify out-of-distribution generalization?
There is one now, though whether you still want to count this as part of SLT or not is a matter of definition.
I’ve said this many times in conversations, but I don’t think I’ve ever written it out explicitly in public, so:
I support some form of global ban or pause on AGI/ASI development. I think the current AI R&D regime is completely insane, and if it continues as it is, we will probably create an unaligned superintelligence that kills everyone.
Yes, subtracting from inequality (1.1) does yield . So, since the total KL divergence summed over the first data points is bounded by the same constant for any , and KL-divergences are never negative, must go to zero for large fast enough for the sum to not diverge to infinity, which implies it has to go to zero faster than 1/n.
Though note that in real life, where is finite, can still go to zero very unevenly; it doesn’t have to be monotonic.
For example, you might have from to , then suddenly see a small upward spike at . A way this might happen is if the first data points the inductor receives come from one data distribution, and the subsequent data points are drawn from a very different distribution. If there is a program that is shorter than (so ) and that can predict the data labels for the first distribution but not the second distribution, whereas can predict both distributions, the inductor would favour over and assign it higher probability until it starts seeing data from the second distribution. It might make up to bits of prediction error early on before its posterior becomes largely dominated by predictions that match at . After that, the KL-divergence would go to zero for a while because everything is getting predicted accurately. Then, at , when we switch to the second data distribution, the KL-divergence would go up again for while, until the inductor has added another bits of prediction error to the total KL-divergence. From then on the inductor would make predictions that match and so the KL-divergence would go back down to zero again and this time stay zero permanently.
I think a potential drawback of this strategy is that people tend to become more hesitant to argue with you. Their instincts tell them you’re a high-status person they can’t afford to offend or risk looking stupid in front of. If you seem less confident, less cool, and less high-status, the mental barrier for others to be disagreeable, share weird ideas, or voice confusion in your presence is lower.
I try to remember to show off some uncoolness and uncertainty for this reason, especially around more junior people. I used to have a big seal plushie on my desk in the office, partially because I just like cute stuffed animals, but also to try to signal that I am approachable and non-threatening and can be safely disagreed with.
I don’t think quantum immortality changes anything. You can rephrame this in terms of standard probability theory and condition on them continuing to have subjective experience, and still get to the same calculus.
I agree that quantum mechanics is not really central for this on a philosophical level. You get a pretty similar dynamic just from having a universe that is large enough to contain many almost-identical copies of you. It’s just that it seems at present very unclear and arguable whether the physical universe is in fact anywhere near that large, whereas I would claim that a universal wavefunction which constantly decoheres into different branches containing different versions of us is pretty strongly implied to be a thing by the laws of physics as we currently understand them.
However, only considering the branches in which you survive, or conditioning on having subjective experience after the suicide attempt, ignores the counterfactual suffering prevented in all the branches (or probability mass) in which you did die, which may be less unpleasant than the branches in which you survived, but are many many more in number! Ignoring those branches biases the reasoning toward rare survival tails that don’t dominate the actual expected utility.
It is very late here and I should really sleep instead of discussing this, so I won’t be able to reply as in-depth as this probably merits. But, basically, I would claim that this is not the right way to do expected utility calculations when it comes to ensembles of identical or almost-identical minds.
A series of thought experiments might maybe help illustrate part of where my position comes from:
Imagine someone tells you that they will put you to sleep and then make two copies of you, identical down to the molecular level. They will place you in a room with blue walls. They will place one copy of you in a room with red walls, and the other copy in another room with blue walls. Then they will wake all three of you up.
What color do you anticipate seeing after you wake up, and with what probability?
I’d say 2⁄3 blue, 1⁄3 red. Because there will now be three versions of me, and until I look at the walls I won’t know which one I am.
Imagine someone tells you that they will put you to sleep and then make two copies of you. One copy will not include a brain. It’s just a dead body with an empty skull. Another copy will be identical to you down to the molecular level. Then they will place you in a room with blue walls, and the living copy in a room with red walls. Then they will wake you and the living copy up.
What color do you anticipate seeing after you wake up, and with what probability? Is there a 1⁄3 probability that you ‘die’ and don’t experience waking up because you might end up ‘being’ the corpse-copy?
I’d say 1⁄2 blue, 1⁄2 red, and there is clearly no probability of me ‘dying’ and not experiencing waking up. It’s just a bunch of biomass that happens to be shaped like me.
As 2, but instead of creating the corpse-copy without a brain, it is created fully intact, then its brain is destroyed while it is still unconscious. Should that change our anticipated experience? Do we now have a 1⁄3 chance of dying in the sense that we might not experience waking up? Is there some other relevant sense in which we die, even if it does not affect our anticipated experience?
I’d say no and no. This scenario is identical to 2 in terms of the relevant information processing that is actually occurring. The corpse-copy will have a brain, but it will never get to use it, so it won’t affect my expected anticipated experience in any way. Adding more dead copies doesn’t change my anticipated experience either. My best scoring prediction will be that I have 1⁄2 chance of waking up to see red walls, and 1⁄2 chance of waking up to see blue walls.
In real life, if you die in the vast majority of branches caused by some event, i.e. that’s where the majority of the amplitude is, but you survive in some, the calculation for your anticipated experience would seem to not include the branches where you die for the same reason it doesn’t include the dead copies in thought experiments 2 and 3.
(I think Eliezer may have written about this somewhere as well using pretty similar arguments, maybe in the quantum physics sequence, but I can’t find it right now.)
I don’t think it proves too much. Informed decision-making comes in degrees, and some domains are just harder? Like, I think my threshold for leaving people free to make their own mistakes if they are the only ones harmed by them is very low, compared to where the human population average seems to be at the moment. But my threshold is, in fact, greater than zero.
For example, there are a bunch of things I think bystanders should generally prevent four year old human children from doing, even if the children insist that they want to do them. I know that stopping four year old children from doing these things will be detrimental in some cases, and that having such policies is degrading to the childrens’ agency. I remember what it was like being four years old and feeling miserable because of kindergarten teachers who controlled my day and thought they knew what was best for me. I still think the tradeoff is worth it on net in some cases.
I just think that the suicide thing happens to be a case where doing informed decision-making is maybe just too tough for way too many humans and thus some form of ban could plausibly be worth it on net. Sports betting is another case where I was eventually convinced that maybe a legal ban of some form could be worth it.
I think very very many people are not making an informed decision when they decide to commit suicide.
For example, I think quantum immortality is quite plausibly a thing. Very few people know about quantum immortality and even fewer have seriously thought about it. This means that almost everyone on the planet might have a very mistaken model of what suicide actually does to their anticipated experience.[1] Also, many people are religious and believe in a pleasant afterlife. Many people considering suicide are mentally ill in a way that compromises their decision making. Many people think transhumanism is impossible and won’t arrange for their brain to be frozen for that reason.
I agree that there is some threshold on the fraction of ill-considered suicides relative to total suicides such that suicide should be legal if we were below that threshold. I used to think we were maybe below that threshold. After I began studying physics at uni and so started taking quantum immortality more seriously, I switched to thinking we are maybe above the threshold.
You might find yourself in a branch where your suicide attempt failed, but a lot of your body and mind were still destroyed. If you keep exponentially decreasing the amplitude of your anticipated future experience in the universal wave function further, you might eventually find that it is now dominated by contributions from weird places and branches far-off in spacetime or configuration space that were formerly negligible, like aliens simulating you for some negotiation or other purpose.
I don’t really know yet how to reason well about what exactly the most likely observed outcome would be here. I do expect that by default, without understanding and careful engineering our civilisation doesn’t remotely have the capability for yet, it’d tend to be very Not Good.
Assuming that the bits to parameters encoding can be relaxed, there’s some literature about redundant computations in neural networks. If the feature vectors in a weight matrix aren’t linearly independent, for example, the same computation can be “spread” over many linearly dependent features, with the result that there are no free parameters but the total amount of computational work is the same.
There’s a few other cases like this where we know how various specific forms of simplicity in the computation map onto freedom in the parameters. But those are not enough in this case. We need more freedom than that.
If every bit of every weight were somehow used to store one bit of , excepting those weights used to simulate the UTM, that should suffice to derive the conjecture, yes.[1]
I think that’s maybe even harder than what I tried to do though. It’s theoretically fine if our scheme is kind of inefficient in terms of how much code it can store in a given number of parameters, so long as the leftover parameter description bits are free to vary.
There’d be some extra trickiness in that under these definitions, the parameters are technically real numbers and thus have infinity bits of storage capacity, though in real life they’re of course actually finite precision floating point numbers.
The same could be said of transformers run in chain of thought mode. But I tried deriving conjecture 2 for those, and didn’t quite succeed.
The trouble is that you need to store the programs in the RNN/transformer weights, and do it in a way that doesn’t ‘waste’ degrees of freedom. Suppose for example that we try to store the code for the programs in the MLPs, using one ReLU neuron to encode each bit via query/key lookups. Then, if we have more neurons than we need because the program is short, we have a lot of freedom in choosing the weights and biases of those unnecessary neurons. For example, we could set their biases to some very negative value to ensure the neurons never fire, and then set their input and output weights to pretty much any values. So long as the weights stay small enough to not overwhelm the bias, the computation of the network won’t be affected by this, since the ReLUs never fire.
The problem is that this isn’t enough freedom. To get in the formula without a prefactor , we’d need the biases and weights for those neurons to be completely free, able to take any value in .
EDIT: Wrote the comment before your edit. No, I haven’t tried it for RNNs.
(Mediation)
Wait, so it’s enough for the agents to just believe the observables are independent given the state of their latents? We only need the to be independent conditional on under a particular model ?
I didn’t realise that. I thought the observables had to be ‘actually independent’ after conditioning in some sort of frequentist sense.
Getting a version of this that works under approximate Agreement on Observables sounds like it would be very powerful then. It’d mean that even if Alice is much smarter than Bob, with her model e.g. having more FLOP which she can use to squeeze more bits of information out of the data, there’d still need to be a mapping between the concepts Bob and Alice internally use in those domains where Bob doesn’t do very much worse than Alice on predictive accuracy.
So, if a superintelligence isn’t that much better than humanity at modelling some specific part of reality, there’d need to be an approximate mapping between humanity’s latents and (some of) the superintelligence’s latents for that part of reality. If the the theorems approximately hold under approximate agreement on observables.
The proposal at the end looks somewhat promising to me on a first skim. Are there known counterpoints for it?