kaarelh AT gmail DOT com
Kaarel
Unfortunately, there exist unkind people. We can select for people with eg long prosocial careers, high introspection, positive interviews with close friends / relatives, cognitomotor symptoms of empathy etc
I think a central property you don’t mention explicitly is [having kept promises in the past, especially in cases where these required doing difficult costly things and where the person thought they would not be rewarded in the future for having kept the promise]; also, honesty. I’d guess that an important class of potential successes with this kind of scheme, in fact maybe most of the successes (but idk), involve the fooming mind [keeping a promise]/[maintaining a commitment]. I think that maintaining some kind of kindness without a specific commitment to helping existing humans in some way can easily “misgeneralize” to eg some sort of utilitarianism, and nearly every kind of utilitarianism endorses the atoms and negentropy of all existing people being used for something else, or just more generally misgeneralize to caring about new people you can create and various other possible beings and activities over existing people.
also copying a note i wrote for myself on this topic: ”
some ideas for safe self-improvement
Most people currently thinking about AI alignment seem to hope that there is some sort of “formula” for safely/[value/character-preservingly]/whatever becoming more capable (and for alignment more broadly). I doubt there is some such formula to be found. Instead, I think that as one becomes more capable, one should keep thinking carefully about how to become more capable, and that there isn’t some “formula” for how to do this thinking. I think there is very much to be understood about how to become more capable “safely”. This note presents some basic ideas for that.
Why care about how to self-improve safely?
Here are some more concrete reasons to be interested in ideas for safe self-development:
You might hope to somehow make an AI at
human intelligence that kinda “properly cares about humanity” for its intelligence level, to let it self-improve until it can take over the world, and to have it “preserve” this property of properly caring about humanity throughout this self-improvement well enough that it then ends the present period of (imo) high existential risk from AI. We could imagine the ideas here being developed into an initial guidebook for such an AI. This would especially matter for the period during which the self-improving AI is still kinda dumb so it couldn’t yet write a better guidebook itself. Once it becomes kind of smart, one might hope that it would do something like continually improving this guidebook as it proceeds. I think this “plan” is crazy and shouldn’t be attempted, but many people seem to think that it would be just fine to let 2025 Claude foom or whatever, so maybe you dear reader are interested in this.mind upload case. both the step from human to mind upload and also later steps
humanity fooming together
Some ideas and questions for self-improving safely
maybe most importantly, you have to actually think about whether some idea for self-improving is fine. you shouldn’t just be doing stuff carelessly. you have to think about how to do this thinking. you will already be doing this by default when you’re doing “object-level” thinking, but i wanted to make this explicit. like, you have to be developing theory/understanding that helps you with these questions. you have to be writing ever better versions of something like this list for yourself
for example, as a mind upload, before making a bunch of clones of yourself and sharing power with them, you should try to think through analogous things you are familiar with to understand the sorts of issues you might run into (eg what are the issues with democracies? how is a society disanalogous to an individual in general)
making a new mind from scratch is an extremely extremely scary/stupid way to “self-improve”. (haha @ humanity.) you should probably basically only be doing stuff that looks much more like becoming smarter yourself — new versions should mostly have the same structure(s) as previous versions. like, you should do stuff much more like adding or switching out small parts one by one
you try self-modifications. you leave a previous version of yourself behind to analyze post-modification you, that is supposed to be able to roll back changes. i mean this might just look like terminating the version with a certain change
this works much better if you remain “honest/transparent about what you’re like to the evaluator” along the branches considered. so you should try to do that
my guess is that one should specify some initial such structure but this needs to be arbitrarily editable by some version of the guy, because otherwise it will eventually be a stupid broken formality
there could be monitors of various levels doing various things. eg you can have a more ancient version communicating its judgments to a more intermediate version who still has the ability to reroll its future
you can implement a voting system
maybe one could try to make you believe in reward/punishment after all is said and done. like an islamic suicide bomber?
basically, the mind will kinda have to believe a falsehood
but i mean this seems possible in humans. individual humans do great science and philosophy and still believe this often. humanity “believed” this for a long time
i mean maybe there’s some way to make it kinda-not-a-falsehood?
how do you maintain a belief in god over very much thinking / capability gain (and the thing being basically false)
could look at the literature on this
how do you stay committed to your partner? i mean: how do you stay in love, how do you stay friends, how do you keep wanting to do lots of stuff together?
could look at the literature on this or just say some obvious stuff here. and then see to what extent that generalizes to the various cases of interest
sth making it a good example is that it involves a lot of growth
the following is an important difference though: in this case you’re growing alone, not with what you’re supposed to stay committed to
another important difference: you will need to keep caring about entities you couldn’t really work with, couldn’t really be intellectual partners with
how does a state stay democratic?
godspeed friend ”
it’s much closer to meaning “roman” than “walnut folk” is! romania = roman-ia = wallach-ia. it does also literally mean an eastern romance speaker, which is a subset of latin speakers, and it got that meaning from just meaning romance/latin speaker earlier
almost every proposal anyone has ever made about what a good future should maximize turns out to be a different mathematical operation performed on this one field
I think this is completely false, or at least completely false if it intends to describe conceptions of good futures in general, though maybe technically partly saved by the specific word “maximize” because maybe that word would only be used by a very specific kind of guy. E.g. I think conceptions of utopia associated with the following types of ethical views will minimally be very contrived to view in this spacetime integral way: deontology, virtue ethics, liberalism, traditionalism, preference utilitarianism[1], any kind of utilitarianism that cares about structures stretching across time (eg there being played out life narratives), views caring about the beauty of large spacetime structures, thinking of a well-lived life in terms of ongoingly chosen projects, thinking of things in terms of living a worthwhile life, thinking of human society as a being that is supposed to live a long worthwhile life, thinking of stuff in terms of good ongoing development of beings (such as humans and humanity), thinking of a good future in terms of god, any view which would be contrived to think of in terms of it seeking to create a certain kind of spacetime block, any view that rejects the claim that there should be an era of thinking about stuff followed by an era of implementing stuff (as opposed to thoughtful ethical life just continuing), etc
I think a strict version of this view where you’re literally applying some sort of functional to a fun field is even false/contrived of almost all forms of welfare utilitarianism, because even those care about experiences of macroscopic beings (like, my happiness is not well-thought-of as an aggregate of happinesses of my quantum fields (or even my atoms or cells or whatever)), and usually in principle arbitrarily large ones (eg you could have a galaxy-sized happy being whose happiness is not well-thought-of as an aggregate of the happinesses of its components).- ^
at least the version that isn’t about there being many preference-seems-satisfied mental events, but about preferences actually getting satisfied
- ^
hmm yes, i took “the laws of nature” to mean something like laws giving what we canonically understand to be our universe, and not to include laws giving some weird other simple cellular automaton on which some aliens live who are hacking into our predictor, but maybe i misunderstood what you meant.
But even if you say “our world”, arguably that still works: if I’m living on a computer run by aliens, then arguably the base reality where the computer sits is my world
hmm, interesting point. if the aliens are building giant antimatter statues with your predicted inputs and the shortest program is looking at those, then i think the shortest program isn’t a simulation of your world with a pointer at you, because you aren’t a being at these statues? there could be a computer inside this alien world running a simulation of you and reading off your raw inputs with a pointer, but that’s not what is getting directly pointed at in these malign predictors. however, i guess if pointing at the malign statues is the best predictor, then pointing to your inputs inside the computer on which you are simulated by these malign aliens would be not too far behind, because one can point at it via the statues (appending “now look for the same thing on a computer in the same universe”), and then maybe we should think that you live there more than in what we would naively consider base reality or other kinds of simulations of base reality. my intuitive guess is that you’re not actually (mostly) living on a computer in this malign alien world even if they are the best predictors, but i’ll need to think more about this. anyway, even if you mostly live on a computer in the malign alien world, i still think that a pointer to the malign statues is not a pointer to your location in the universe
I’m interested though why you believe there are shorter good predictors than world + pointer. I agree it’s possible, I just can’t think of one. How would they look like?
tbh mostly the argument that there are incredibly many different programs and there’s some really clever constructions in there, and from that starting point i’d need a strong reason to think the best program looks like a world+pointer, and i don’t really see any strong reason. i don’t know what the best programs look like, mostly i don’t think any of the good programs are intelligible to us, and i can’t really tell you a better one. to state a hyperparam: my guess is that there are better programs that look somewhat more like clever guys thinking about what next bits to guess, than like worlds + pointers. i don’t have a specific better program in mind though, so i’m unable to give a great answer. i’ll keep the question in mind and try to write another comment in the future if i think or hear of something.
one point i can make: i think that for most reasonable input streams, you can’t have an actual full simulation because (given our current best understanding of physics) the specification length of the initial conditions and quantum branching (i mean the part of these in our past lightcone) is greater than the length of even the uncompressed raw input stream itself (of course the optimal compression won’t be the raw stream, because one can compress it a lot; this just means one can’t compress it this way). however this doesn’t rule out some sort of partial simulation
The malign program you’re describing does not look like specifying our laws of physics and some initial state and running it forward and reading off across a specified pointer into our universe. It also involves some aliens, or at least simulating an alien universe and reading off what is done there, or something. I agree the malign program you’re describing has a world+pointer design, but note that your original claim “It’s the laws of nature plus a pointer to my specific location in the universe.” is stronger than this, and afaict this malign predictive program would be a counterexample to this stronger claim.
In my previous comment, I should have said “the programs suggested in canonical presentations of malignity clearly don’t look like the [simulation of our world] + [pointer into our world] design at the top level”.
(Fwiw I do also think there are actually shorter good predictors that don’t look at the top level like simulations + pointers.)
one further weird thing about this is that if it’s really only this constant separation, then some UTMs of length like 20 will just prefer boltzmann brains (like you do sth like making “look for sth in a heat death soup” be interpreted to be in a program for free, having to write some characters to get rid of that), in a way that never goes away with more observations. partly because of this i have some intuition that mere constant separations are kind of concerning (as opposed to differences between hypotheses which become larger with more data), but not sure
Maybe I’m missing something, but yes, I think “the laws of nature plus a pointer to my location” is a good description. Why would that be so bizarre?
well mostly the argument i have in mind is that the space of all programs is really crazy, there’s some really clever stuff in there that one wouldn’t think of, and any specific way for the program to look is very unlikely, eg this way. to give something that seems more likely to me, if you actually collect a data set of your visual inputs onto a hard drive to send through a portal that just appeared to a solomonoff inductor in another universe, then i think pointing at the hard drive and continuing with eg “\n” or “00000″ or lots of other simple things will be simpler than continuing to predict your actual visual inputs well. [1] also, don’t you already agree that solomonoff being malign is at least plausible, and the programs suggested in canonical presentations of that clearly don’t look like the simulation+pointer design at the top level, right?
- ↩︎
i think this is true even if we assume the portal only goes in one direction, ie your future visual inputs are not causally downstream of the inductor. ie, this is still a problem if one removes the issue of good prediction of stuff downstream of you being cursed.
- ↩︎
But I have no idea how big the quantum effects are on the weather tomorrow, and when I say I give a 10% chance for rain, I’m clearly not referring to the true quantum probabilities.
After reading this, I was confused by you not raising a very similar objection to grounding probabilities in what Solomonoff would say. Like, it similarly seems clear that you’re not referring to the true Solomonoff probabilities either? In many situations, a very good predictor would already 99.99%-know the answer to a question you’re uncertain about. Good probabilities needn’t have much to do with the probabilities of an ideal predictor. In particular, in the following later paragraph, I think you’re making sth close to the mistake you critiqued in the quantum proposal:
It’s tempting to say that one should define probabilities as the result of Solomonoff induction. Probabilities would be still subjective in the sense that no one can actually run the full Solomonoff induction, so we are all just giving our best guesses. But I can at least still say that the guy who gives 50% probability to Bigfoot standing next door is wrong in the sense that I’m confident that’s not close to what the Solomonoff induction says.
Say that we have a data sequence which has been the digits of pi in binary for the first 1000 items, and we’re predicting the
th item. I say it’s 50:50; you say “that’s really wrong! that’s clearly far from what solomonoff induction thinks, because it already basically knows the answer!”. Or if you say it’s 99.9:0.1 and you turn out to be right, then you were being reasonable with your probability because that’s similar to what Solomonoff would have said (I’m certainly not confident that is really wrong; indeed, I have close to 50% that solomonoff thinks something close to it)? Or if we have a UTM such that with probability the first item in an empty sequence is 1, then I’m unreasonable to guess ? One could say something about better and worse strategies for guessing Solomonoff’s probabilities, or maybe something about how predictions are supposed to be eventually graded with a proper scoring rule, or something, but I think one can approximately equally try to save the quantum definition this way, and at that point talking about Solomonoff or quantum amplitudes isn’t adding any clarity. Even if we were guessing Solomonoff’s probabilities, one would want to give some account of what we are doing when we are doing this guessing; probably one would end up wanting to say that this guessing would itself be done in probabilistic terms, but then one would still need to explain that sort of probabilistic reasoning; and it presumably wouldn’t be explained as “we are guessing Solomonoff’s probabilities about Solomonoff’s probabilities” (where the “guessing” again gets unfolded the same way, repeated arbitrarily many times, I guess?). So this looks circular and it looks like one would want to give some other account of probabilistic reasoning.
I think a much better picture is that we’re not guessing what an ideal predictor would say about whether Bigfoot is in the room, we’re guessing whether Bigfoot is in fact in the room. And it would just be silly to think that Bigfoot is in the room with probability; from inside our thinking community, this looks like an objective mistake, and one doesn’t need to reference Solomonoff to make this judgment. This is maybe like how a pretrained LLM is not registering its guesses for what solomonoff would say next, it’s just guessing next tokens.
This is a bit similar to how truth is not proVability. Probabilities aren’t defined as the outputs of some ideal thing. We reason probabilistically and this is a successful activity, and we can make some sense of the success of this sort of activity with eg coherence theorems or theorems saying solomonoff induction has some nice properties. (I think it makes sense to say solomonoff induction is an ideal thing that’s somewhat analogous to good probabilistic reasoning; I just think it doesn’t make sense to try to translate probabilistic statements into statements about solomonoff.) This doesn’t require giving any definition to “the probability of P is p”, just like one doesn’t need to define “P is true”[1].
In conclusion, I think it makes sense to use solomonoff induction as an analogy to what one is doing when one reasons probabilistically, but I don’t think it makes sense to try to rewrite probabilistic statements into some statements about solomonoff induction. (To clarify, I don’t think this is a serious criticism of the broader philosophical thesis in the sequence, I just think you’re confused/wrong about a subtle philosophical point about probabilities which doesn’t sink the overall framework.)
It will take about
random fluctuations in the heat death soup of atoms for my brain to accidentally emerge. Pinpointing the exact moment when my brain emerges therefore takes about bits.I think you’re overestimating the complexity of pointing at boltzmann brain you-s compared to regular you-s. Can’t I point at a boltzmann brain you by pointing at regular you and then saying “look for the same thing inside the heat death soup; give me the first example”? I think this should be at most a small const longer than pointing at non-boltzmann you, so boltzmann you-s should get some basically constant weight compared to non-boltzmann you-s in this picture (however because we’re exponentiating, the constant could perhaps be quite small, eg maybe
). Or is specifying a pointer in this fashion not allowed by the version of UDASSA you have in mind?
But if I naively apply Solomonoff induction to my observations, the shortest program producing what I, David Matolcsi, am observing is not just a description of the laws of the universe. It’s the laws of nature plus a pointer to my specific location in the universe. It’s the laws of nature plus a pointer to my specific location in the universe.
Why do you think “it’s the laws of nature plus a pointer to my specific location in the universe”? Do you actually think this? Given what you say next about solomonoff being malign, it seems like maybe you don’t actually think this? Maybe you meant to say sth like “rather than being sth like a game of life corresponding to our universe, it’d need to be sth like that together with a specification of a measurement channel (but it could also be something else entirely)”? My guess is that the actual shortest program printing all your raw inputs so far would be some other really bizarre thing.
This would imply that I’m probably in a simple-to-describe place in the universe, but it doesn’t really look like it, especially if I take into account the quantum multiverse.
Btw, the UTM version of solomonoff induction has some const mass on arbitrarily complicated strings (like, not on any individual string, but on all of them together).[1] (Maybe you know this already. edit: Ok, reading your next post, you indeed understand this already.)
- ^
to be precise: Consider the set of bitstrings of length
whose kolmogorov complexity is at least . For all large enough , these strings together have measure at least , with the constant being at least 0.99 times the exp of negative the description length of the shortest program which samples output bits independently 50⁄50 at random.
- ^
And pausing is a stopgap: eventually, superintelligence will be developed.
if we wanted to and kept wanting to, we could just ban AGI for a very long time and maybe forever. [1] instead of making non-human-descended top thinkers, we could just continue carefully becoming more capable and intelligent as humans. i don’t think this would be that weird. regulation of development (and specifically the development of intelligence) much more stringent than a very long AGI ban is probably going to happen in any future with intelligent life, because probably many aspects of
anyone’s values cannot compete/[remain in control]/[be realized]/survive through that much unregulated development. (however, this stringent control end up mostly being self-regulation of a singleton.) that is, extreme regulation of the development of thought is probably inevitable [2] ; there’s just a question of whether it’s implemented by humans or some future AI(s).that said, i agree it is unlikely that we ban AGI in practice (i think it’s likely we make an AGI this century and that’s the worst thing ever done). i’m in part writing the present comment because a nontrivial contributor to it being unlikely that we get an AGI ban is that people think and say this sort of stuff. i’ve been irked by the same sort of fatalism about AGI in a bunch of other writing by forethought, 80000 hours, and many others in EA / AI safety and beyond. to clarify: i’m certainly not asking anyone who thinks an AGI ban is unlikely to lie about that, and i’m also not asking anyone to stop saying “an AGI ban is unlikely” when that is pertinent. but i think many people are weirdly systematically speaking as if an AGI ban is not an option worth considering, including many who in fact believe that it would be a good idea to ban AGI for a long time (if it could be done) or at least have significant probability on that
- ↩︎
i think this is also a better thing for middle powers to fight for than what you propose (which i’d semi-seriously call a collaborationism-maxing policy package), for the sake of the world and also for their own sakes
- ↩︎
ok maybe a landian/marxist/[nature red in tooth and claw] scenario where everything that isn’t competitive keeps getting annihilated is also possible, idk. tbh i just really wanted to state this antithesis to “AGI is inevitable”. maybe some qualifier like “if the future is not absolutely ruthless” needs to be added.
- ↩︎
Our ability to take advantage of this period is bottlenecked on the quality of our specification generation infrastructure, elicitation tooling (for proofs & specs etc.), and the institutional capacity for scaling useful outputs with capital.
(assuming there will be such a period,) imo a central bottleneck is having the philosophical understanding necessary to systematically turn scientific/technological/philosophical problems into mathematical problems. [1] [2] afaik we currently have no idea how to do this. (or maybe there is not even any nice procedure for systematically doing such reductions to be found, idk.)
currently, for almost all scientific/technological/philosophical problems, if one manages to reduce them to well-defined mathematical problems, i think that’s most of the way toward solving them. so, it’s usually very difficult. consider how there are very few precise mathematical problems whose solutions economists or physicists would be excited about. currently, imo and afaik, there are basically no well-defined mathematical problems whose solutions would help much with the AGI situation
linking some stuff of mine relevant to this: slides of a talk on verification-based alignment schemes, slides of a talk on modeling, messy draft paper on formalizing messy questions
(if under “improving the quality of our specification generation infrastructure” you already meant to include radical philosophical progress on precisification/verification outside math, then nvm :P)
- ↩︎
or into some other kind of precisely specified problem, if there is such a thing as a precisely specified problem that is not a math problem. or at least having the philosophical understanding necessary for doing principled verification of properties or arguments outside well-defined math
- ↩︎
just finding one pivotal problem that can be turned into a tractable math problem would also be sufficient
- ↩︎
not really important but here’s something i find funny/confusing about the question and a bunch of takeover-talk in general. i guess taking over is putting yourself in a position from which you can well-control the future. but if taking over is easy for you, then you are already in that position even before you do anything. so it feels like takeover has already happened by the beginning of this imagined scenario? one way to make sense of things is to replace “takeover” with eg “killing all humans”. another way to make sense of things is to say what we mean by takeover is maneuvering yourself into a position such that you can then be “statically in that position” and keep controlling things in the future, and we can construct a scenario where this has not already happened by the beginning of the thought experiment because the AI is not already in a stably world-controlling position and it only has a brief window for moving itself into a stably world-controlling position. however by default in that case i feel like it could just take over and then un-take-over and that isn’t really having taken over at all, any more than it had already taken over by the beginning of the thought experiment, and then even if the AI is really averse to the actual thing of taking over, it shouldn’t be averse to that plan, at least not more than it is averse to being in its initial position. ok i guess we can add the condition that some magic is preventing it from un-taking-over...
another funny/confusing thing is that while taking over is pro tanto immoral, in the current precarious AI situation, taking over and banning AGI would in fact imo be an aligned/good thing to do
It is like people are unable to picture utopia the same way someone born in the 1900s could probably not picture pocket computers or anything on the internet in full detail.
an important reason very very very good worlds are hard to picture (especially in full detail) is that they are very far away from us in development time. like, i think there would probably be more technological/economic/social development between now and very very very good worlds than between the big bang and now. these worlds would be extremely hard for us to make sense of (though ultimately not meaningless). also, my guess is that these worlds will still be developing; this would thwart attempts to conceive of them as given/finished things
or you might be asking why people find it hard to picture any world that is even merely much better than ours, and not necessarily very very very good or near-perfect. in that case, my comment is less of a response
yea probably. still, if you have parties at 40% and 60% and they do this asset bet at 50:50, then each guy’s subjective expected money is 20% higher than if they just buy the asset without the bet. seems nontrivial. this will be less impressive in log money as you start putting a larger fraction of your money in a single bet but idk you try to spread across many bets that are not too correlated and then i think it looks good again
making trades than predictions
maybe the following solves a significant fraction of that problem: you could buy an asset together and have the event being predicted determine the owner. like, to make a bet at 30%, instead of one party putting 30 cents and the other party putting 70 cents in a jar and having the 1 dollar go to the party that predicted correctly once the question is resolved, you could do this with 1 dollar’s worth of an SP500 index fund or any other asset. [1] [2]
- ↩︎
not sure why kalshi hasn’t implemented this already btw — seems like a central issue with current prediction markets. maybe there’s a regulatory obstacle. or maybe they are already putting the money traders put in their jars in low-risk assets, just not passing the interest on to traders (except in the form of it enabling lower fees or whatever).
- ↩︎
ok for vanilla bets not on prediction markets, one doesn’t actually need to store the money in a jar. i think this fixes one way this jar business is bad but not some other ways
- ↩︎
of course, one solution is to ban AGI so humans remain useful to the weltgeist. i also think more generally that, at least at humanity’s current level of intellectual+societal development, we are probably forced to work with a messy world largely driven by soulless forces like greed/power-seeking/status-seeking, and we should think more in terms of getting to change some hyperparameters of the mess so as to make these soulless things drive the world toward being good more (and we should maintain ways in which they already do), ie in terms of “making goodness win/out-compete” and specifically in terms of making good stuff (such as humans and human institutions) instrumentally useful. creating AGI is roughly the worst possible thing to do from this perspective, as it completely destroys the usefulness of basically every good thing. (that said, even though i think this is probably the only practical option, i think a lot of resources should still be put into trying to “solve AI alignment”, of the “try to figure out how to take over the world with AI and stop other AGI attempts” variety and of other varieties.)
https://tsvibt.github.io/theory/index_What_is_God_.html