i blog at carado.moe.
alignment research here. current alignment plan: QACI.
i’m also on twitter.
i blog at carado.moe.
alignment research here. current alignment plan: QACI.
i’m also on twitter.
the counterfactuals might be defined wrong but they won’t be “under-defined”. but yes, they might locate the blob somewhere we don’t intend to (or insert the counterfactual question in a way we don’t intend to); i’ve been thinking a bunch about ways this could fail and how to overcome them (1, 2, 3).
on the other hand, if you’re talking about the blob-locating math pointing to the right thing but the AI not making accurate guesses early enough as to what the counterfactuals would look like, i do think getting only eventual alignment is one of the potential problems, but i’m hopeful it gets there eventually, and maybe there are ways to check that it’ll make good enough guesses even before we let it loose.
(cross-posted as a top-level post on my blog)
QACI and plausibly PreDCA rely on a true name of phenomena in the real world using solomonoff induction, and thus talk about locating them in a theoretical giant computation of the universe, from the beginning. it’s reasonable to be concerned that there isn’t enough compute for an aligned AI to actually do this. however, i have two responses:
isn’t there enough compute? supposedly, our past lightcone is a lot smaller than our future lightcone, and quantum computers seem to work. this is evidence that we can, at least in theory, build within our future lightcone a quantum computer simulating our past lightcone. the major hurdle here would be “finding out” a fully explanatory “initial seed” of the universe, which could take exponential time, but also could maybe not.
we don’t need to simulate past lightcone. if you ask me what my neighbor was thinking yesterday at noon, the answer is that i don’t know! the world might be way too complex to figure that out without simulating it and scanning his brain. however, i have a reasonable distribution over guesses. he was more likely to think about french things than korean things. he was more likely to think about his family than my family. et cetera. an aligned superintelligence can hold an ever increasingly refined distribution of guesses, and then maximize the expected utility of utility functions corresponding to each guess.
sounds maybe kinda like a utopia design i’ve previously come up with, where you get your private computational garden and all interactions are voluntary.
that said some values need to come interfere into people’s gardens: you can’t create arbitrarily suffering moral patients, you might have to in some way be stopped from partaking in some molochianisms, etc.
i don’t think determinism is incompatible with making decisions, just like nondeterminism doesn’t mean my decisions are “up to randomness”; from my perspective, i can either choose to do action A or action B, and from my perspective i actually get to steer the world towards what those action lead to.
put another way, i’m a compatibilist; i implement embedded agency.
put another way, yes i LARP, and this is a world that gets steered towards the values of agents who LARP, so yay.
what i mean here is “with regards to how much moral-patienthood we attribute to things in it (eg for if they’re suffering), rather than secondary stuff we might care about like how much diversity we gain from those worlds”.
(this answer is cross-posted on my blog)
here is a list of problems which i seek to either resolve or get around, in order to implement my formal alignment plans, especially QACI:
formal inner alignment: in the formal alignment paradigm, “inner alignment” means refers to the problem of building an AI which, when ran, actually maximizes the formal goal we give it (in tractable time) rather than doing something else such as getting hijacked by an unaligned internal component of itself. because its goal is formal and fully general, it feels like building something that maximizes it should be much easier than the regular kind of inner alignment, and we could have a lot more confidence in the resulting system. (progress on this problem could be capability-exfohazardous, however!)
continuous alignment: given a utility function which is theoretically eventually aligned such that there exists a level of capabilities at which it has good outcomes for any level above it, how do we bridge the gap from where we are to that level? will a system “accidentally” destroy all values before realizing it shouldn’t have done that?
blob location: for QACI, how do we robustly locate pieces of data stored on computers encoded on top of bottom-level-physics turing-machine solomonoff hypotheses for the world? see 1, 2, 3 for details.
observation data precision and computation substrate: related to the previous problem, how precisely does the prior we’re using need to capture our world, for the intended instance of the blobs to be locatable? can we just find the blobs in the universal program — or, if P≠BQP, some universal quantum program? do we need to demand worlds to contain, say, a dump of wikipedia to count as ours? can we use the location of such a dump as a prior for the location of the blobs?
infrastructure design: what formal-math language will the formal goal be expressed in? what kind of properties should it have? should it include some kind of proving system, and in what logic? in QACI, will this also be the language for the user’s answer? what kind of checksums should accompany the question and answer blobs? these questions are at this stage premature, but they will need some figuring out at some point if formal alignment is, as i currently believe, the way to go.
i would love a world-saving-plan that isn’t “a clever scheme” with “many moving parts” but alas i don’t expect it’s what we get. as clever schemes with many moving parts go, this one seems not particularly complex compared to other things i’ve heard of.
Only a single training example needed through use of hypotheticals.
(to be clear, the question and answer serve less as “training data” meant to represent the user, but as “IDs” or “coordinates” menat to locate the user in past-lightcone.)
We need good inner alignment. (And with this, we also need to understand hypotheticals).
this is true, though i think we might not need a super complex framework for hypotheticals. i have some simple math ideas that i explore a bit here, and about which i might write a bunch more.
for failure modes like the user getting hit by a truck or spilling coffee, we can do things such as at each step asking not 1 cindy the question, but asking 1000 cindy’s 1000 slight variations on the question, and then maybe have some kind of convolutional network to curate their answers (such as ignoring garbled or missing output) and pass them to the next step, without ever relying on a small number of cindy’s except at the very start of this process.
it is true that weird memes could take over the graph of cindy’s; i don’t have an answer to that apart that it seems sufficiently not likely to me that i still think this plan has promise.
Chaos theory. Someone else develops a paperclip maximizer many iterations in, and the paperclip maximizer realizes it’s in a simulation, hacks into the answer channel and returns “make as many paperclips as possible” to the AI.
hmm. that’s possible. i guess i have to hope this never happens on the question-interval, on any simulation day. alternatively, maybe the mutually-checking graph of a 1000 cindy’s can help with this? (but probly not; clippy can just hack the cindy’s).
So all the virtual humans get saved on disk, and then can live in the utopia. Hey, we need loads of people to fill up the dyson sphere anyway.
yup. or, if the QACI user is me, i’m probly also just fine with those local deaths; not a big deal compared to an increased chance of saving the world. alternatively, instead of being saved on disk, they can also just be recomputed later since the whole process is deterministic.
I am not confident that your “make it complicated and personal data” approach at the root really stops all the aliens doing weird acausal stuff.
yup, i’m not confident either. i think there could be other schemes, possibly involving cryptography in some ways, to entangle the answer with a unique randomly generated signature key or something like that.
i mean sure but i’d describe both as utility maximizers because maximizing utility is it fact what they consistently do. Dragon God’s claim seems to be that we wouldn’t get an AI that would be particularly well predicted by utility maximization, and this seems straightforwardly false of agents 1 and 2.
(this response is cross-posted as a top-level post on my blog)
i expect the thing that kills us if we die, and the thing that saves us if we are saved, to be strong/general coherent agents (SGCA) which maximize expected utility. note that this is two separate claims; it could be that i believe the AI that kills us isn’t SGCA, but the AI that saves us still has to be SGCA. i could see shifting to that latter viewpoint; i currently do not expect myself to shift to believing that the AI that saves us isn’t SGCA.
to me, this totally makes sense in theory, to imagine something that just formulates plans-over-time and picks the argmax for some goal. the whole of instrumental convergence is coherent with that: if you give an agent a bunch of information about the world, and the ability to run eg linux commands, there is in fact an action that maximizes the amount of expected paperclips in the universe, and that action does typically entail recursively self-improving and taking over the world and (at least incidentally) destroying everything we value. the question is whether we will build such a thing any time soon.
right now, we have some specialized agentic AIs: alphazero is pretty good at reliably winning at go; it doesn’t “get distracted” with other stuff. to me, waiting for SGCA to happen is like waiting for a rocket to get to space in the rocket alignment problem: once the rocket is in space it’s already too late. the whole point is that we have to figure this out before the first rocket gets to space, because we only get to shoot one rocket to space. one has to build an actual inside view understanding of agenticity, and figure out if we’ll be able to build that or not. and, if we are, then we need to solve alignment before the first such thing is built — you can’t just go “aha, i now see that SGCA can happen, so i’ll align it!” because by then you’re dead, or at least past its decisive strategic advantage.
i’m not sure how to convey my own inside view of why i think SGCA can happen, in part because it’s capability exfohazardous. maybe one can learn from IEM or the late 2021 MIRI conversations? i don’t know where i’d send someone to figure this out, because i think i largely derived it from the empty string myself. it does strongly seem to me that, while a single particular neural net might not be the first thing to be an SGCA, we can totally bootstrap SGCA from existing ML technology; it might just take a clever trick or two rather than being the completely direct solution of “oh you train it like this and then it becomes SGCA”. recursive self-improvement is typically involved.
we also have some AIs, including sydney, which aren’t SGCA. it might even be that SGCA is indeed somewhat unnatural for a lot of current deep learning capabilities. nevertheless, i believe such a thing is likely enough to be built that it’s what it takes for us to die — maybe non-SGCA AI’s impact on the economy would slowly disempower us over the course of 20~40 years, but in those worlds AI tech gets good enough that 5 years into it someone figures out the right clever trick to build (something that bootstraps to) SGCA and we die of agentic intelligence explosion very fast before we get to see the slow economic disempowerement. in addition, i believe that our best shot is to build an aligned SGCA.
why haven’t animals or humans gotten to SGCA? well, what would getting from messy biological intelligences to SGCA look like? typically, it would look like one species taking over its environment while developing culture and industrial civilization, overcoming in various ways the cognitive biases that happened to be optimal in its ancestral environment, and eventually building more reliable hardware such as computers and using those to make AI capable of much more coherent and unbiased agenticity.
that’s us. this is what it looks like to be the first species to get to SGCA. most animals are strongly optimized for their local environment, and don’t have the capabilities to be above the civilization-building criticality threshold that lets them build industrial civilization and then SGCA AI. we are the first one to get past that threshold; we’re the first one to fall in an evolutionary niche that lets us do that. this is what it looks like to be the biological bootstrap part of the ongoing intelligence explosion; if dogs could do that, then we’d simply observe being dogs in the industrialized dog civilization, trying to solve the problem of aligning AI to our civilized-dog values.
we’re not quite SGCA ourselves because, turns out, the shortest path from ancestral-environment-optimized life to SGCA is to build a successor that is much closer to SGCA. if that successor is still not quite SGCA enough, then its own successor will probly be. this is what we’re about to do, probly this decade, in industrial civilization. maybe if building computers was much harder, and brains were more reliable to the point that rational thinking was not a weird niche thing you have to work on, and we got an extra million years or two to evolutionarily adapt to industrialized society, then we’d become properly SGCA. it does not surprise me that that is not, in fact, the shortest path to SGCA.
yeah, totally, i’m also just using that post as a jump-off point for a more in-depth long-form discussion about dragon god’s beliefs.
there’s a lot to unpack here. i feel like i disagree with a lot of this post, but it depends on the definitions of terms, which in turns depends on what those questions’ answers are supposed to be used for.
what do you mean by “optimality across all domains” and why do you care about that?
what do you mean by “efficiency in all domains wrt human civilization” and why do you care about that?
there also are statements that i easily, straight-up disagree with. for example,
To put this in a succinct form, I think a superintelligence can’t beat SOTA dedicated chess AIs running on a comparable amount of compute.
that feels easily wrong. 2026 chess SOTA probly beats 2023 chess SOTA. so why can’t superintelligent AI just invent in 2023 what we would’ve taken 3 years to invent, get to 2026 chess SOTA, and use that to beat our SOTA? it’s not like we’re anywhere near optimal or even remotely good at designing software, let alone AI. sure, this superintelligence spends some compute coming up with its own better-than-SOTA chess-specialized algorithm, but that investment could be quickly reimbursed. whether it can be reimbursed within a single game of chess is up to various constant factors.
a superintelligence beat existing specialized systems because it can turn itself into what they do but also turn itself into better than what they do, because it also has the capability “design better AI”. this feels sufficient for superingence to beat any specialized system that doesn’t have general-improvement part — if it does, it probly fooms to superintelligence pretty easily itself. but, note that this might even not be necessary for superintelligence to beat existing specialized systems. it could be that it improves itself in a very general way that lets it be better on arrival to most existing specialized systems.
this is all because existing specialized systems are very far from optimal. that’s the whole point of 2026 chess SOTA beating 2023 chess SOTA — 2023 chess SOTA isn’t optimal, so there’s room to find better, and superintelligence can simply make itself be a finder-of-better-things.
Thus, I expect parts of human civilisation/economic infrastructure to retain comparative advantage (and plausibly even absolute advantage) on some tasks of economic importance wrt any strongly superhuman general intelligences due to the constraints of pareto optimality.
okay, even if this were true, it doesn’t particularly matter, right ? like, if AI is worse than us at a bunch of tasks, but it’s good enough to take over enough of the internet to achieve decisive strategic advantage and then kill us, then that doesn’t really matter a lot.
so sure, the AI never learned to drive better than our truckers and our truckers never technically went through lost their job to competition, but also everyone everywhere is dead forever.
but i guess this relies on various arguments about the brittleness of civilization to unaligned AI.
I think there would be gains from trade between civilisation and agentic superintelligences. I find the assumptions that a superintelligence would be as far above civilisation as civilisation is above an ant hill nonsensical.
why is that? even if both of your claims are true, that general optimality is impossible and general efficiency is infeasible, this does not stop an AI from specializing at taking over the world, which is much easier than outcompeting every industry (you never have to beat truckers at driving to take over the world!). and then, it doesn’t take much inside view to see how an AI could actually do this without a huge amount of general intelligence; yudkowsky’s usual scheme for AI achieving DSA and us all falling dead within the same second, as explained in the podcast he was recently on, is one possible inside-view way for this to happen.
we’re “universal”, maybe, but we’re the very first thing that got to taking over the world. there’s no reason to think that the very first thing to take over the world is also the thing that’s the best at taking over the world; and surprise, here’s one that can probly beat us at that.
and that’s all excluding dumb ways to die such as for example someone at a protein factory just plugging an AI into the protein design machine to see what funny designs it’ll come up with and accidentally kill everyone with neither user nor AI having particularly intended to do this (the AI is just outputting “interesting” proteins).
i’ve made some work towards building that machinery (see eg here) but yes still there are still a bunch of things to be figured out, though i’m making progress in that direction (see the posts about blob location).
are you saying this in the prescriptive sense, i.e. we should want that property? i think if implemented correctly, accuracy is all we would really need right? carrying human intent in those parts of the reasoning seems difficult and wonky and plausibly not necessary to me, where straightforward utility maximization should work.