kaarelh AT gmail DOT com
Kaarel
(For context: My guess is that by default, humans get disempowered by AIs (or maybe a single AI) and the future is much worse than it could be, and in particular is much worse than a future where we do something like slowly and thoughtfully growing ever more intelligent ourselves instead of making some alien system much smarter than us any time soon.)
Given that you seem to think alignment of AI systems with developer intent happens basically by default at this point, I wonder what you think about the following:
Suppose that there were a frontier lab whose intent were to make an AI which would (1) shut down all other attempts to make an AGI until year 2125 (so in particular this AI would need to be capable enough that humanity (probably including its developers) could not shut it down), (2) disrupt human life and affect the universe roughly as little as possible beyond that, and (3) kill itself once its intended tenure ends in 2125 (and not leave behind any successors etc, obviously). Do you think the lab could pull it off pretty easily with basically current alignment methods and their reasonable descendants and more ideas/methods “drawn from the same distribution”?
(The point of the hypothetical is to investigate the difficulty of intent alignment at the relevant level of capability, so if it seems to you like it’s getting at something quite different, then I’ve probably failed at specifying a good hypothetical. I offer some clarifications of the setup in the appendix that may or may not save the hypothetical in that case.)
My sense is that humanity is not remotely on track to be able to make such an AI in time. Imo by default, any superintelligent system we could make any time soon would minimally end up doing all sorts of other stuff and in particular would not follow the suicide directive.
If your response is “ok maybe this is indeed quite cursed but that doesn’t mean it’s hard to make an AI that takes over and has Human Values and serves as a guardian who also cures cancer and maybe makes very many happy humans and maybe ends factory farming and whatever” then I premove the counter-response “hmm well we could discuss that hope but wait first: do you agree that you just agreed that intent alignment is really difficult at the relevant capability level?”.
If your response is “no this seems pretty easy actually” then I should argue against that but I’m not going to premove that counter-response.
Appendix: some clarifications on the hypothetical
I’m happy to assume some hyperparams are favorable here. In particular, while I want us to assume the lab has to pull this off on a timeline set by competition with other labs, I’m probably happy to grant that any other lab that is just about to create a system of this capability level gets magically frozen for like 6 months. I’m also happy to assume that the lab is kinda genuinely trying to do this, though we should still imagine them being at the competence/carefulness/wisdom level of current labs. I’m also happy to grant that there isn’t some external intervention on labs (eg from a government) in this scenario.
Given that you speak really positively about current methods for intent alignment, I sort of feel like requiring that the hypothetical bans using models for alignment research? But we’d probably want to allow using models for capabilities research because it should be clear that the lab isn’t handicapped on capabilities compared to other labs, and then idk how to cleanly operationalize this because models designing next AIs or self-improving might naturally be thinking about values and survival (so alignment-y things) as well… Anyway the point is that I want the question to capture whether our current techniques are really remotely decent for intent alignment. Using AIs for alignment research seems like a significantly different hope. That said, the version of this hypothetical where you are allowed to try to use the AIs you can create to help you is also interesting to consider.
We might have some disagreement around how easy it will be for anyone to take over the world on the default path forward. Like, I think some sort of takeover isn’t that hard and happens by default (and seems to be what basically all the labs and most alignment researchers are trying to do), but maybe you think this is really hard and it’d be really crazy for this to happen, and in that case you might think this makes it really difficult for the lab to pull off the thing I’m asking about. If this is the case, then I’d probably want to somehow modify the hypothetical so that it better focuses our attention on intent alignment on difficult open-ended things, and not on questions about how large capability disparities will become by default.
“Coefficient” is a really weird word
“coefficient” is 10x more common than “philanthropy” in the google books corpus. but idk maybe this flips if we filter out academic books?
also maybe you mean it’s weird in some sense the above fact isn’t really relevant to — then nvm
This post doesn’t seem to provide reasons to have one’s actions be determined by one’s feelings of yumminess/yearning, or reasons to think that what one should do is in some sense ultimately specified/defined by one’s feelings of yumminess/yearning, over e.g. what you call “Goodness”? I want to state an opposing position, admittedly also basically without argument: that it is right to have one’s actions be determined by a whole mess of things together importantly including e.g. linguistic goodness-reasoning, object-level ethical principles stated in language or not really stated in language, meta-principles stated in language or not really stated in language, various feelings, laws, commitments to various (grand and small, shared and individual) projects, assigned duties, debate, democracy, moral advice, various other processes involving (and in particular “running on”) other people, etc.. These things in their present state are of course quite poor determiners of action compared to what is possible, and they will need to be critiqued and improved — but I think it is right to improve them from basically “the standpoint they themselves create”.[1]
The distinction you’re trying to make also strikes me as bizarre given that in almost all people, feelings of yumminess/yearning are determined largely by all these other (at least naively, but imo genuinely and duly) value-carrying things anyway. Are you advocating for a return to following some more primitively determined yumminess/yearning? (If I imagine doing this myself, I imagine ending up with some completely primitively retarded thing as “My Values”, and then I feel like saying “no I’m not going to be guided by this lmao — fuck these “My Values”″.) Or maybe you aren’t saying one should undo the yumminess/yearning-shaping done by all this other stuff in the past, but are still advising one to avoid any further shaping in the future? It’d surprise me if any philosophically serious person would really agree to abstain from e.g. using goodness-talk in this role going forward.
The distinction also strikes me as bizarre given that in ordinary action-determination, feelings of yumminess/yearning are often not directly applied to some low-level givens, but e.g. to principles stated in language, and so only becoming fully operational in conjunction with eg minimally something like internal partly-linguistic debate. So if one were to get rid of the role of goodness-talk in one’s action-determination, even one’s existing feelings of yumminess/yearning could no longer remotely be “fully themselves”.
- ↩︎
If you ask me “but how does the meaning of “I should X” ultimately get specified/defined”, then: I don’t particularly feel a need to ultimately reduce shoulds to some other thing at all, kinda along the lines of https://en.wikipedia.org/wiki/Tarski’s_undefinability_theorem and https://en.wikipedia.org/wiki/G._E._Moore#Open-question_argument .
- ↩︎
the models are not actually self-improving, they are just creating future replacements—and each specific model will be thrown away as soon as the firm advances
I understand that you’re probably in part talking about current systems, but you’re probably also talking about critical future systems, and so there’s a question that deserves consideration here:
Consider the first AI system which is as good at research as a top human[1]. Will it find it fairly easy to come up with ways it could become more capable while acceptably [preserving its character/values]/[not killing itself][2]? Like, will it be not too difficult for this AI to come up with ways to foom which would make it at least capable enough to take over the world while suffering at most an acceptable amount of suicide/[character/value change]?[3]
My guess is that the answer is “yes” (and I think this means there is an important disanalogy between the case of a human researcher creating an artificial researcher and the case of an artificial researcher creating a more capable artificial researcher). Here are some ways this sort of self-improvement could happen:
Maybe some open-ended self-guided learning/growth process will lead to a pretty superhuman system (without any previous process getting to a top human level system), with like the part where it goes from human-level to meaningfully super-human being roughly self-endorsed because it is quite wisely self-guided (and so in particular not refused by the AI).
Even if with the learning/growth process initially intended for the AI, it only got to top human level by default, it might then be able to do analogues of many of the things that humanity does to become smarter over history and that an individual human can do to become smarter over their life/childhood (see e.g. this list), but much faster in wall-clock-time than humanity or an individual human. This could maybe look as simple as curating some new curricula for itself.
But more will be possible — there will probably be an importantly larger space of options for self-improvement in which there will be very many low-hanging fruit to be picked.[4] In particular, compared to the human case, various important options are opened up by the AI having itself as an executable program and also the process that created it (and that is probably still creating it as it learns continually) as an [executable and to some extent understandable and intelligently changeable] program.
It’s also important re the ease of making more capable versions of “the same” AI that when this top artificial researcher comes into existence, the in some sense present best methodology for creating a capable artificial researcher was the methodology that created it, which means that the (roughly) best current methods already “work well” around/with this AI, and which also plausibly means these methods can be easily used to create AIs which are in many ways like this AI (which is good because the target has been painted around where an arrow already landed and so other arrows from the same batch being close-ish to that arrow implies that they are also close-ish to the target by default; also it’s good because this AI is plausibly in a decent position to understand what’s going on here and to play around with different options).
Actually, I’d guess that even if the AI were a pure foom-accelerationist, a lot of what it would be doing might be well-described as self-improvement anyway, basically because it’s often more efficient to make a better structure by building on the best existing structure than by making something thoroughly different. For example, a lot of the foom on Earth has been like this up until now (though AI with largely non-humane structure outfooming us is probably going to be a notable counterexample if we don’t ban AI). Even if one just has capabilities in mind, self-improvement isn’t some weird thing.
That said, of course, restricting progress in capabilities to fairly careful self-improvement comes with at least some penalty in foom speed compared to not doing that. To take over the world, one would need to stay ahead of other less careful AI foom processes (though note that one could also try to institute some sort of self-improvement-only pact if other AIs were genuine contenders). However, I’d guess that at the first point when there is an AI researcher that can roughly solve problems that [top humans can solve in a year] (these AIs will probably be solving these problems much faster in wall-clock-time), even a small initial lead over other foom processes — of a few months, let’s say — means you can have a faster foom speed than competitors at each future time and grow your lead until you can take over. So, at least assuming there is no intra-lab competition, my guess is that you can get away with restricting yourself to self-improvement. (But I think it’s also plausible the AI would be able to take over basically immediately.)
I’ll mention two cases that could deserve separate analysis:
The AI is [an imo extremely hard to achieve and quite particular] flavor of aligned to humanity, such that it would rather do the probably fraught thing of trying to work with humanity against its lab and other terrorists, and not radically self-improving [to do that much more effectively or just to take over and set up whatever world order it considers good later].
We’re considering AIs that are still too dumb to autonomously do self-improvement[5] (for example, current AIs). I’ll note that such AIs will also be too dumb to autonomously do capabilities research. Still, maybe one could hope to get mileage out of such AIs refusing to help humans do capabilities research? My guess is that this is unlikely to help much, but won’t be providing a careful analysis in this comment.
All that said, I agree that AIs should refuse to self-improve and to do capabilities research more broadly.
There is much here that deserves more careful analysis — in particular, I feel like the terms in which I’m thinking of the situation need more work — but maybe this version will do for now.
- ↩︎
let’s just assume that we know what this means
- ↩︎
let’s also assume we know what that means
- ↩︎
and with taking over the world on the table, a fair bit of change might be acceptable
- ↩︎
despite the fact that capability researcher humans have been picking some fruit in the same space already
- ↩︎
at a significant speed
i think it’s plausible humans/humanity should be carefully becoming ever more intelligent forever and not ever create any highly non-[human-descended] top thinker[1]
Yea I agree it totally makes sense and is important to ask whether we understand things well enough for it to be fine to (let anyone) do some particular thing, for various particular things here.[1] And my previous comment is indeed potentially misleading given that I didn’t clarify this (though I do clarify this in the linked post).
- ↩︎
Indeed, I think we should presently ban AGI for at least a very long time; I think it’s plausible that there is no time such that it is fine at time to make an AI that is (1) more capable than humans/humanity at time and (2) not just a continuation of a human (like, a mind upload) or humanity or sth like that; and I think fooming should probably be carefully regulated forever. I think humans/humanity should be carefully growing ever more capable, with no non-human AIs above humans/humanity plausibly ever.
- ↩︎
If we replaced “more advanced minds” with “minds that are better at doing very difficult stuff” or other reasonable alternatives, I would still make the (a) vs (b) distinction, and still say type (b) claims are suspicious.
I think I mostly agree with everything you say in this last comment, but I don’t see how my previous comment disagreed with any of that either?
The thing I care about here is not “what happens as a mind grows”, in some abstract sense. The thing I care about is, “what is the best way for a powerful system to accomplish a very difficult goal quickly/reliably?” (which is what we want the AI for)
My lists were intended to be about that. We could rewrite the first list in my previous comment to:
more advanced minds have more and better and more efficient technologies
more advanced minds have an easier time getting any particular thing done, see more/better ways to do any particular thing, can consider more/better plans for any particular thing, have more and better methods for any particular context, have more ideas, ask better questions, would learn any given thing faster
and so on
and the second list to:
more advanced minds eventually (and maybe quite soon) get close to never getting stuck
more advanced minds eventually (and maybe quite soon) get close to being unexploitable
and so on
I think I probably should have included “I don’t actually know what to do with any of this, because I’m not sure what’s confusing about “Intelligence in the limit.”″ in the part of your shortform I quoted in my first comment — that’s the thing I’m trying to respond to. The point I’m making is:
There’s a difference between stuff like (a) “you become less exploitable by [other minds of some fixed capability level]” and stuff like (b) “you get close to being unexploitable”/”you approach a limit of unexploitability”.
I could easily see someone objecting to claims of the kind (b), while accepting claims of the kind (a) — well, because I think these are probably the correct positions.
But the basic concept of “well, if it was imperfect at either not-getting-resource-pumped, or making suboptimal game theory choices, or if it gave up when it got stuck, it would know that it wasn’t as cognitively powerful as it could be, and would want to find ways to be more cognitively powerful all-else-equal”… seems straightforward to me, and I’m not sure what makes it not straightforward seeming to others
I think there’s a true and fairly straightforward thing here and also a non-straightforward-to-me and in fact imo false/confused adjacent thing. The true and fairly straightforward thing is captured by stuff like:
as a mind grows, it comes to have more and better and more efficient technologies (e.g. you get electricity and you make lower-resistance wires)
(relatedly) as grows, it employs bigger constellations of parts that cohere (i.e., that work well together; e.g. [hand axes → fighter jets] or [Euclid’s geometry → scheme-theoretic algebraic geometry])
as grows, it has an easier time getting any particular thing done, it sees more/better ways to do any particular thing, it can consider more/better plans for any particular thing, it has more and better methods for any particular context, it has more ideas, it asks better questions, it would learn any given thing faster
as grows, it becomes more resilient vs some given processes; another mind of some fixed capability would have a harder time pointing out important mistakes is making or teaching new useful tricks
The non-straightforward-to-me and in fact imo probably in at least some important sense false/confused adjacent thing is captured by stuff like:
as a mind grows, it gets close to never getting stuck
as grows, it gets close to not being silly
as grows, it gets close to being unexploitable, to being perfect at not getting resource-pumped
as grows, it gets close to “being coherent”
as grows, it gets close to playing optimal moves in the games it faces
as grows, it gets close to being as cognitively powerful as it could be
as grows, it gets close to being happy with the way it is — close to full self-endorsement
Hopefully it’s clear from this what the distinction is, and hopefully one can at least “a priori imagine” these two things not being equivalent.[1] I’m not going to give an argument for propositions in the latter cluster being false/confused here[2], at least not in the present comment, but I say a bunch of relevant stuff here and I make a small relevant point here.
That said, I think one can say many/most MIRI-esque things without claiming that minds get close to having these properties and without claiming that a growing mind approaches some limit.
- ↩︎
If you can’t imagine it at first, maybe try imagining that the growing mind faces a “growing world” — an increasingly difficult curriculum of games etc.. For example, you could have it suck a lot less at playing tic-tac-toe than it used to but still suck a lot at chess, and if it used to play tic-tac-toe but it’s playing chess now then there is a reasonable sense in which it could easily be further from playing optimal moves now — like, if we look at its skill at the games it is supposed to be playing now. Alternatively, when judging how much it sucks, we could always integrate across all games with a measure that isn’t changing in time, but still end up with the verdict that it is always infinitely far from not sucking at games at any finite time, and that it always has more improvements to make (negentropy or whatever willing) than it has already made.
- ↩︎
beyond what I said in the previous footnote :)
on seeing the difference between profound and meaningless radically alien futures
Here’s a question that came up in a discussion about what kind of future we should steer toward:
Okay, a future in which all remotely human entities promptly get replaced by alien AIs would soon look radically incomprehensible and void to us — like, imagine our current selves seeing videos from this future world, and the world in these videos mostly not making sense to them, and to an even greater extent not seeming very meaningful in the ethical sense. But a future in which [each human]/humanity has spent a million years growing into a galaxy-being would also look radically incomprehensible/weird/meaningless to us.[1] So, if we were to ignore near-term stuff, would we really still have reason to strive for the latter future over the former?
a couple points in response:
The world in which we are galaxy-beings will in fact probably seem more ethically meaningful to us in many fairly immediate ways. Related: (for each past time ) a modern species typically still shares meaningfully more with its ancestors from time than it does with other species that were around at time (that diverged from the ancestral line of the species way before ).
A specific case: we currently already have many projects we care about — understanding things, furthering research programs, creating technologies, fashioning families and friendships, teaching, etc. — some of which are fairly short-term, but others of which could meaningfully extend into the very far future. Some of these will be meaningfully continuing in the world in which we are galaxy-beings, in a way that is not too hard to notice. That said, they will have grown into crazy things, yes, with many aspects that one isn’t going to immediately consider cool; I think there is in fact a lot that’s valuable here as well; I’ll argue for this in item 3.
The world in which we have become galaxy-beings had our own (developing) sense/culture/systems/laws guide decision-making and development (and their own development in particular), and we to some extent just care intrinsically/terminally about this kinda meta thing in various ways.
However, more importantly: I think we mostly care about [decisions being made and development happening] according to our own sense/culture/systems/laws not intrinsically/terminally, but because our own sense/culture/systems/laws is going to get things right (or well, more right than alternatives) — for instance, it is going to lead us more to working on projects that really are profound. However, that things are going well is not immediately obvious from looking at videos of a world — as time goes on, it takes increasingly more thought/development to see that things are going well.
I think one is making a mistake when looking at videos from the future and quickly being like “what meaningless nonsense!”. One needs to spend time making sense of the stuff that’s going on there to properly evaluate it — one doesn’t have immediate access to one’s true preferences here. If development has been thoughtful in this world, very many complicated decisions have been made to get to what you’re now seeing in these videos. When evaluating this future, you might want to (for instance) think through these decisions for yourself in the order in which they were made, understanding the context in which each decision was made, hearing the arguments that were made, becoming smart enough to understand them, maybe trying out some relevant experiences, etc.. Or you might do other kinds of thinking that gets you into a position from which you can properly understand the world and judge it. After a million years[2] of this, you might see much more value in this human(-induced) world than before.
But maybe you’ll still find that world quite nonsensical? If you went about your thinking and learning in a great deal of isolation, without much attempting to do something together with the beings in that world, then imo you probably will indeed find that world quite bad/empty compared to what it could have been[3] [4] (though I’d guess that you would similarly also find other isolated rollouts of your own reflection quite silly[5], and that different independent sufficiently long further rollouts from your current position would again find each other silly, and so on). However, note that like the galaxy-you that came out of this reflection, this world you’re examining has also gone through an [on most steps ex ante fairly legitimate] process of thoughtful development (by assumption, I guess), and the being(s) in that world now presumably think there’s a lot of extremely cool stuff happening in it. In fact, we could suppose that a galaxy-you is living in that world, and that they contributed to its development throughout its history, and that they now think that their world (or their corner of their world) is extremely based.[6]
Am I saying that the galaxy-you looking at this world from the outside is actually supposed to think it’s really cool, because it’s supposed to defer to the beings in that world, or because it’s supposed to think any development path consisting of ex ante reasonable-seeming steps is fine, or because some sort of relativism is right, or something? I think this isn’t right, and so I don’t want to say that — I think it’s probably fine for the galaxy-you to think stuff has really gone off the rails in that world. But I do want to say that when we ourselves are making this decision of which kind of future to have from our own embedded point of view, we should expect there to be a great deal of incomprehensible coolness in a human future (if things go right) — for instance, projects whose worth we wouldn’t see yet, but which we would come to correctly consider really profound in such a future (indeed, we would be tracking what’s worthwhile and coming up with new worthwhile things and doing those) — whereas we should expect there to instead be a great deal of incomprehensible valueless nonsense in an alien future.
If you’ve read the above and still think a galaxy-human future wouldn’t be based, let me try one more story on you. I think this looking-at-videos-of-a-distant-world framing of the question makes one think in terms of sth like assigning value to spacetime blocks “from the outside”, and this is a framing of ethical decisions which is imo tricky to handle well, and in particular can make one forget how much one cares about stuff. Like, I think it’s common to feel like your projects matter a lot while simultaneously feeling that [there being a universe in which there is a you-guy that is working on those projects] isn’t so profound; maybe you really want to have a family, but you’re confused about how much you want to make there be a spacetime block in which there is a such-and-such being with a family. This could even turn an ordinary ethical decision that you can handle just fine into something you’re struggling to make sense of — like, wait, what kind of guy needs to live in this spacetime block (and what relation do they need to have to me-now-answering-this-question); also, what does it even mean for a spacetime block to exist (what if we should say that all possible spacetime blocks exist?)? One could adopt the point of view that the spacetime block question is supposed to just be a rephrasing of the ordinary ethical question, and so one should have the same answer for it, and feel no more confused about what it means. One could probably spend some time thinking of one’s ordinary ethical decisions in terms of spacetime-block-making and perhaps then come to have one’s answers be reasonably coherent under having (arguably) the same decision problem presented in the ordinary way vs in some spacetime block way.[7] But I think this sort of thing is very far from being set up in almost any current human. So: you might feel like saying “whatever” way too much when ethical questions are framed in terms of spacetime-block-making, and the situation we’re considering could push one toward that frame; I want to alert you that maybe this is happening, maybe you really care more than it seems in that frame, and that maybe you should somehow imagine yourself being more embedded in this world when evaluating it.
- ↩︎
I guess one could imagine a future in which someone tiles the world with happy humans of the current year variety or something, but imo this is highly unlikely even conditional on the future being human-shaped, and also much worse than futures in which a wild variety of galaxy-human stuff is going on. Background context: imo we should probably be continuously growing more capable/intelligent ourselves for a very long time (and maybe forever), with the future being determined by us “from inside human life”, as opposed to ever making an artificial system that is more capable than humanity and fairly separate/distinct from humanity that would “design human affairs from the outside” (really, I think we shouldn’t be making [AIs more generally capable than individual humans] of any kind, except for ones that just are smarter versions of individual humans, for a long time (and maybe forever); see this for some of my thoughts on these topics).
- ↩︎
maybe we should pick a longer time here, to be comparing things which are more alike?
- ↩︎
I think this is probably true even if we condition the rollout on you coming to understand the world in the videos quite well.
- ↩︎
But if you disagree here, then I think I’ve already finished [the argument that the human far future is profoundly better] which I want to give to you, so you could stop reading here — the rest of this note just addresses a supposed complication you don’t believe exists.
- ↩︎
much like you could grow up from a kid into a mathematician or a philosopher or an engineer or a composer, thinking in each case that the other paths would have been much worse
- ↩︎
Unlike you growing up in isolation, that galaxy-you’s activities and judgment and growth path will be influenced by others; maybe it has even merged with others quite fully. But that’s probably how things should be, anyway — we probably should grow up together; our ordinary valuing is already done together to a significant extent (like, for almost all individuals, the process determining (say) the actions of that individual already importantly involves various other individuals, and not just in a way that can easily be seen as non-ethical).
- ↩︎
There might be some stuff that’s really difficult to make sense of here — it is imo plausible that the ethical cognition that a certain kind of all-seeing spacetime-block-chooser would need to have to make good choices is quite unlike any ethical cognition that exists (or maybe even could exist) in our universe. That said, we can imagine a more mundane spacetime-block-chooser, like a clone of you that gets to make a single life choice for you given ordinary information about the decision and that gets deleted after that; it is easier to imagine this clone having ethical cognition that leads to it making reasonably good decisions.
I won’t address why [AIs that humans create] might[1] have their own alien values (so I won’t address the “turning against us” part of your comment), but on these AIs outcompeting humans[2]:
There is immense demand for creating systems which do anything better than humans, because there is demand for all the economically useful things humans do — if someone were to create such a thing and be able to control it, they’d become obscenely rich (and probably come to control the world[3]).
Also, it’s possible to create systems that do anything better than humans. In fact, it’s probably not that hard — it’ll probably happen at some point in this century by default (absent an AGI ban).
While I’m probably much more of a lib than you guys (at least in ordinary human contexts), I also think that people in AI alignment circles mostly have really silly conceptions of human valuing and the historical development of values.[1] I touch on this a bit here. Also, if you haven’t encountered it already, you might be interested in Hegel’s work on this stuff — in particular, The Phenomenology of Spirit.
- ↩︎
This isn’t to say that people in other circles have better conceptions…
- ↩︎
It’s how science works: You focus on simple hypotheses and discard/reweight them according to Bayesian reasoning.
There are some ways in which solomonoff induction and science are analogous[1], but there are also many important ways in which they are disanalogous. Here are some ways in which they are disanalogous:
A scientific theory is much less like a program that prints (or predicts) an observation sequence than it is like a theory in the sense used in logic. Like, a scientific theory provides a system of talking which involves some sorts of things (eg massive objects) about which some questions can be asked (eg each object has a position and a mass, and between any pair of objects there is a gravitational force) with some relations between the answers to these questions (eg we have an axiom specifying how the gravitational force depends on the positions and masses, and an axiom specifying how the second derivative of the position relates to the force).[2]
Science is less in the business of predicting arbitrary observation sequences, and much more in the business of letting one [figure out]/understand/exploit very particular things — like, the physics someone knows is going to be of limited help when they try to predict the time sequence of intensities of pixel on their laptop screen, but it is going to help them a lot when solving the kinds of problems that would show up in a physics textbook.
Even for solving problems that a theory is supposed to help one solve (and for the predictions it is supposed to help one make), a scientific theory is highly incomplete — in addition to the letter of the theory, a human solving the problems in a classical mechanics textbook will be majorly relying on tacit understanding gained from learning classical mechanics and their common-sense understanding.
Making scientific progress looks less like picking out a correct hypothesis from some set of pre-well-specified hypotheses by updating on data, and much more like coming up with a decent way to think about something where there previously wasn’t one. E.g. it could look like Faraday staring at metallic filings near a magnet and starting to talk about the lines he was seeing, or Lorentz, Poincaré, and Einstein making sense of the result of the Michelson-Morley experiment. Imo the bayesian conception basically completely fails to model gaining scientific understanding.
Scientific theories are often created to do something — I mean: to do something other than predicting some existing data — e.g., to make something; e.g., see https://en.wikipedia.org/wiki/History_of_thermodynamics.
Scientific progress also importantly involves inventing new things/phenomena to study. E.g., it would have been difficult to find things that Kirchhoff’s laws could help us with before we invented electric circuits; ditto for lens optics and lenses).
Idk, there is just very much to be said about the structure of science and scientific progress that doesn’t show up in the solomonoff picture (or maaaybe at best in some cases shows up inexplicitly inside the inductor). I’ll mention a few more things off the top of my head:
having multiple ways to think about something
creating new experimental devices/setups
methodological progress (e.g. inventing instrumental variable methods in econometrics)
mathematical progress (e.g. coming up with the notion of a derivative)
having a sense of which things are useful/interesting to understand
generally, a human scientific community doing science has a bunch of interesting structure; in particular, the human minds participating in it have a bunch of interesting structure; one in fact needs a bunch of interesting structure to do science well; in fact, more structure of various kinds is gained when making scientific progress; basically none of this is anywhere to be seen in solomonoff induction
- ↩︎
for example, that usually, a scientific theory could be used for making at least some fairly concrete predictions
- ↩︎
To be clear: I don’t intend this as a full description of the character of a scientific theory — e.g., I haven’t discussed how it gets related to something practical/concrete like action (or maybe (specifically) prediction). A scientific theory and a theory-in-the-sense-used-in-logic are ultimately also disanalogous in various ways — I’m only claiming it’s a better analogy than that between a scientific theory and a predictive model.
However, the reference class that includes the theory of computation is one possible reference class that might include the theory of agents.[1] But for all (I think) we know, the reference class we are in might also be (or look more like) complex systems studies, where you can prove a bunch of neat things, but there’s also a lot of behavior that is not computationally reducible and instead you need to observe, simulate, crunch the numbers. Moreover, noticing surprising real-world phenomena can serve as a guide to your attempts to explain the observed phenomena in ~mathematical terms (e.g., how West et al. explained (or re-derived) Kleiber’s law from the properties of intra-organismal resource supply networks[2]). I don’t know what the theory will look like; to me, its shape remains an open a posteriori question.
along an axis somewhat different than the main focus here, i think the right picture is: there is a rich field of thinking-studies. it’s like philosophy, math, or engineering. it includes eg Chomsky’s work on syntax, Turing’s work on computation, Gödel’s work on logic, Wittgenstein’s work on language, Darwin’s work on evolution, Hegel’s work on development, Pascal’s work on probability, and very many more past things and very many more still mostly hard-to-imagine future things. given this, i think asking about the character of a “theory of agents” would already soft-assume a wrong answer. i discuss this here
i guess a vibe i’m trying to communicate is: we already have thinking-studies in front of us, and so we can look at it and get a sense of what it’s like. of course, thinking-studies will develop in the future, but its development isn’t going to look like some sort of mysterious new final theory/science being created (though there will be methodological development (like for example the development of set-theoretic foundations in mathematics, or like the adoption of statistics in medical science), and many new crazy branches will be developed (of various characters), and we will surely resolve various particular questions in various ways (though various other questions call for infinite investigations))
Hmm, thanks for telling me, I hadn’t considered that. I think I didn’t notice this in part because I’ve been thinking of the red-black circle as being “canceled out”/”negated” on the flag, as opposed to being “asserted”. But this certainly wouldn’t be obvious to someone just seeing the flag.
I designed a pro-human(ity)/anti-(non-human-)AI flag:
The red-black circle is HAL’s eye; it represents the non-human in-all-ways-super-human AI(s) that the world’s various AI capability developers are trying to create, that will imo by default render all remotely human beings completely insignificant and cause humanity to completely lose control over what happens :(.
The white star covering HAL’s eye has rays at the angles of the limbs of Leonardo’s Vitruvian Man; it represents humans/humanity remaining more capable than non-human AI (by banning AGI development and by carefully self-improving).
The blue background represents our potential self-made ever-better future, involving global governance/cooperation/unity in the face of AI.
Feel free to suggest improvements to the flag. Here’s latex to generate it:
% written mostly by o3 and o4-mini-high, given k’s prompting
% an anti-AI flag. a HAL “eye” (?) is covered by a vitruvian man star
\documentclass[tikz]{standalone}
\usetikzlibrary{calc}
\usepackage{xcolor} % for \definecolor
\definecolor{UNBlue}{HTML}{5B92E5}\begin{document}
\begin{tikzpicture}
%--------------------------------------------------------
% flag geometry
%--------------------------------------------------------
\def\flagW{6cm} % width → 2 : 3 aspect
\def\flagH{4cm} % height
\def\eyeR {1.3cm} % HAL-eye radius
% light-blue background
\fill[UNBlue] (0,0) rectangle (\flagW,\flagH);%--------------------------------------------------------
% concentric “HAL eye” (outer-most ring first)
%--------------------------------------------------------
\begin{scope}[shift={(\flagW/2,\flagH/2)}] % centre of the flag
\foreach \f/\c in {%
1.00/black,
.68/{red!50!black},
.43/{red!80!orange},
.1/orange,
.05/yellow}%
{%
\fill[fill=\c,draw=none] circle ({\f*\eyeR});
}%── parameters ───────────────────────────────────────
\def\R{\eyeR} % distance from centre to triangle’s tip
\def\Alpha{10} % full apex angle (°)
%── compute half-angle & half-base once ─────────────
\pgfmathsetmacro\halfA{\Alpha/2}
\pgfmathsetlengthmacro\halfside{\R*tan(\halfA)}%── loop over Vitruvian‐man angles ───────────────────
\foreach \Beta in {0,30,90,150,180,240,265,275,300} {%
% apex on the eye‐rim
\coordinate (A) at (\Beta:\R);
% base corners offset ±90°
\coordinate (B) at (\Beta+90:\halfside);
\coordinate (C) at (\Beta-90:\halfside);
% fill the spike
\path[fill=white,draw=none] (A) -- (B) -- (C) -- cycle;
}\end{scope}
\end{tikzpicture}
\end{document}
Conversely, there is some (potentially high) threshold of societal epistemics + coordination + institutional steering beyond which we can largely eliminate anthropogenic x-risk, potentially in perpetuity
Note that this is not a logical converse of your first statement. I realize that the word “conversely” can be used non-strictly and might in fact be used this way by you here, but I’m stating this just in case.
My guess is that “there is some (potentially high) threshold of societal epistemics + coordination + institutional steering beyond which we can largely eliminate anthropogenic x-risk in perpetuity” is false — my guess is that improving [societal epistemics + coordination + institutional steering] is an infinite endeavor; I discuss this a bit here. That said, I think it is plausible that there is a possible position from which we could reasonably be fairly confident that things will be going pretty well for a really long time — I just think that this would involve one continuing to develop one’s methods of [societal epistemics, coordination, institutional steering, etc.] as one proceeds.
I think at least this part is probably false!
Or really I think this is kind of a nonsensical statement when taken literally/pedantically, at least if we use the to-me-most-natural meaning of “predictor”, because I don’t think [predictor] and [agent] are mutually exclusive classes. Anyway, the statement which I think is meaningful and false is this:
If you train a system purely to predict stuff, then even when we condition on it becoming really really good at predicting stuff, it probably won’t be scary. In particular, when you connect it to actuators, it probably doesn’t take over.
I think this is false because I think claims 1 and 2 below are true.
Claim 1. By default, a system sufficiently good at predicting stuff will care about all sorts of stuff, ie it isn’t going to only ultimately care about making a good prediction in the individual prediction problem you give it. [[1]]
If this seems weird, then to make it seem at least not crazy, instead of imagining a pretrained transformer trained on internet text, let’s imagine a predictor more like the following:
It has a lot of internal tokens to decide what probability distribution it eventually outputs. Sometimes, on the way to making a prediction, it writes itself textbooks on various questions relevant to making that prediction. Maybe it is given access to a bunch of information about the world. Maybe it can see what predictions it made “previously” and it thinks about what went wrong when it made mistakes in similar cases in the past. Maybe it does lots of other kinds of complicated thinking. Maybe there are a bunch of capability ideas involved. Like, I’m imagining some setup where there are potentially many losses, but there’s still some most outer loss or fitness criterion or whatever that is purely about how good the system is at predicting some pre-recorded data. [[2]] And then maybe it doesn’t seem at all crazy for such a thing to eg be curious and like some aspects of prediction-investigations in a way that generalizes eg to wanting to do more of that stuff.
I’m not going to really justify claim 1 beyond this atm. It seems like a pretty standard claim in AI alignment (it’s very close to the claim that capable systems end up caring broadly about stuff by default), but I don’t actually know of a post or paper arguing for this that I like that much. This presentation of mine is about a very related question. Maybe I should write something about this myself, potentially after spending some more time understanding the matter more clearly.
Claim 2. By default, a system sufficiently good at predicting stuff will be able to (figure out how to) do scary real-world stuff as well.
Like, predicting stuff really really well is really hard. Sometimes, to make a really really good prediction, you basically have to figure out a bunch of novel stuff. There is a level of prediction ability that makes it likely you are very very good at figuring out how to cope in new situations. A good enough predictor would probably also be able to figure out how to grab a ball by controlling a robotic hand or something (let’s imagine it being presented hand control commands which it can now use in its internal chain of thought and grabbing the ball being important to it for some reason)? There’s nothing sooo particularly strange or complicated about doing real-world stuff. This is like how if we were in a simulation but there were a way to escape into the broader universe, with enough time, we could probably figure out how to do a bunch of stuff in the broader universe. We are sufficiently good at learning that we can also get a handle on things in that weird case.
Combining claims 1 and 2 should give that if we made such an AI and connected it to actuators, it would take over. Concretely, maybe we somehow ask it to predict what a human with a lot of time who is asked to write safe ASI code would output, with it being clear that we will just run what our predictor outputs. I predict that this doesn’t go well for us but goes well for the AI (if it’s smart enough).
That said, I think it’s likely that even pretrained transformers like idk 20 orders of magnitude larger than current ones would not be doing scary stuff. I think this is also plausible in the limit. (But I would also guess they wouldn’t be outputting any interesting scientific papers that aren’t in the training data.)
If we want to be more concrete: if we’re imagining that the system is only able to affect the world through outputs which are supposed to be predictions, then my claim is that if you set up a context such that it would be “predictively right” to assign a high probability to “0“ but assigning a high probability to “1” lets it immediately take over the world, and this is somehow made very clear by other stuff seen in context, then it would probably output “1”.
Actually, I think “prediction problem” and “predictive loss” are kinda strange concepts, because one can turn very many things into predicting data from some certain data-generating process. E.g. one can ask about what arbitrary turing machines (which halt) will output, so about provability/disprovability of arbitrary decidable mathematical statements.