kaarelh AT gmail DOT com
Kaarel
some more variations on this theme:
Nick Land in Meltdown: “Nothing human makes it out of the near-future.”, “Capital only retains anthropological characteristics as a symptom of underdevelopment; reformatting primate behaviour as inertia to be dissipated in self-reinforcing artificiality. Man is something for it to overcome: a problem, drag.”
Historical materialism views the organization of society throughout history as being the argmax of production (or maybe argmax the development of production or productive power or something), and after AGI, humans will not be part of the argmax of production for long.
“when you make something less useful (eg by introducing other things that can do its “jobs/functions” better), you make it less likely to stick around”, “what is no longer good for anything tends to get discarded” [1]
“messy futures are bad for humans” (in the limit: “a uniformly random configuration of atoms doesn’t have anything like humans in it”)
- ↩︎
conversely, you can make something more likely to be preserved by figuring out how to make it instrumental to more valued/productive/competitive things/processes — each such process then provides a reason to keep the thing around, and provides a constraint on any replacement to the thing. “instrumentalizing the terminal”, ie protecting good things this way, is a sort of dual to subgoal stomp. i think protection by instrumentality is the main way one gets conserved structures in biological evolution
maybe even more generally, there is a “game of questions/problems and answers/solutions” played by humans and human communities, that one can study to become better able to create a setup in which AIs are playing this game. some questions about this game: “how does an individual human or a human community remain truth-tracking?”, “what structures can do load-bearing work in a truth-tracking system?”, “to involve a new mind in a community of truth/knowledge/understanding, what is required of the new mind and what is required of its teachers/environment?”, “what interventions make a system more truth-tracking?”, “how does one avoid meaning drift/subversion?”. this includes the science stuff you talk about but also very basic stuff like a kid learning arithmetic from their parents or humans working successfully with integrals for two centuries before we could define them rigorously — like, how come we can mostly avoid goodharting answers against the judgment of other people, how come we can mostly avoid becoming predictors of what other people would say, how come we can do easy-to-hard generalization of notions, etc.. the usual losses/setups currently used by ML practitioners might be sorta wrong for these things, and maybe one could think carefully about the human case and come up with better losses/setups to use in an epistemic system. an obstacle is that in the human case, stuff working well is probably meaningfully aided by the agents already having shared human purposes [1] [2] and by already having similar “priors” coming from the human brain architecture and similar upbringings. another obstacle is that the human thing is probably relying on various low-level things that are hard to see and that probably lack equivalents in current ML systems and are too low-level to be created by any simple intervention on a community of LLMs. another obstacle is that there are probably just very many ideas involved in making humans truth-tracking (though you can then ask: how do we set up a meta-level thing that finds and implements good ideas for how an epistemic system should work). another obstacle is that in the human case, human purposes are broadly aligned with understanding stuff better in the systems of understanding we have (whereas if we force some system of presenting understanding on the LLMs and try to get them to produce some understanding and present it legibly in that system, their purposes are probably not well-aligned by default with doing that). (oh also, if your work results in understanding these questions well, you should worry about your work helping with capabilities. maybe don’t give capabilities researchers good answers to “how do we make it so the originators of good ideas get rewarded in an epistemic community?”, “how does one tell when a new notion is good to introduce into the shared lexicon?”, “what is the process of coming up with a good new notion like?”, “what sort of thing is a good model of a situation?”, “how does one avoid assigning a lot of resources to useless cancers like algebraic number theory?” [3] .) anyway, despite these issues, it still seems like an interesting direction to work on
copying a note i wrote for myself on a related question:
″
beating solomonoff induction at grokking a notion
how come as humans we can understand what someone means when using a word. as opposed to becoming a predictor of what they would say. it is possible for a human to not make the mistakes another person would make when eg classifying images for having dogs vs not! roughly speaking solomonoff would be making the same mistakes the person would make
this is a classic issue plaguing many (maybe even most?) things in alignment. eg ELK, AGI via predictive modeling, CIRL/RLHF or just pretty much anything involving human feedback
can’t we write an algo for that, and have that not be dumb like solomonoff is dumb
some ideas for ways to implement a thing that is good like this / what’s going on in making the human thing work:
an even stronger simplicity prior than solomonoff. eg if there are explainable mistakes on a simple model, you want the simple model that doesn’t predict the mistakes. this will have inf log loss but let’s just do a version of the simple hypothesis with noise, and then penalize the likelihood term less. have people not already considered this for solving the model + data split problem? does this attempt to solve the model data split problem introduce some pathologies?
you have pathology of not specifying even the hypothesis in the seq prediction case (like it’ll be better to drop bits and take the likelihood loss). but i think at least this pathology is not present in the function case, if we don’t get randomness in the universal semimeasure way (like if we make the randomness not shared between different inputs — each input has to sample its own random bits)
alternatively: just set abs bound on model complexity, rest has to be likelihood. this feels bad because if you get the bound wrong you get some nonsense. that said in a sense this is equivalent to the previous proposal (like if you pick the length bound the previous thing with some hyperparam would find). idk maybe in the function case you can look at how many bits of entropy are left given the hypothesis, like imagine this graphed as a function of hypothesis length, and like see some point at which the derivative changes or sth. (this doesn’t show up in the seq case because there it’s pretty much just 1 bit paying for 1 bit (until you specify it in full if it’s finite complexity))
simplicity prior defined in terms of existing understanding
you specify properties of the thing or notion sometimes
eg [concrete] and [abstract] make a partition of things maybe, but [alice would think this is concrete] and [alice would think this is abstract] might not. eg knowing [if something is abstract, then it usually helps a lot to study examples to understand it] can help you understand when your teacher alice is making a mistake about an abstractness claim
or eg: 1+1=2 won’t be true if you accidentally assign 1->rabbit and 2->chicken from a demonstration (for any reasonable meaning of plus)
some sort of t complexity bound might help. tho really you aren’t gaining a mechanism when you learn what a dog is. you are more like learning a new question/problem
also as a human one can just ask: what is it that this person is trying to teach me. what is this person trying to point at. this is a question you can approach like any other question
when we gain a notion, we gain sth like a question that can be asked about a thing. and we have criteria on this notion. we gain “inference rules”/”axioms” involving the notion. ultimately we are wanting it to play some role in our thought and action. that role can guide the precisification/development/reworking of the concept. the role can be communicated. it can be
shared between mindsto gain the chair notion is to gain the question “is this a chair?”. this has an immediate verifier (mostly visual), but also further questions: “can i sit on it?”, “is it comfortable to sit on it?”, “would i use it when working or dining?”, “does it have a back support part and a butt support part and legs?”. a chair should support the activities of sitting and working and dining. all these can have their own immediate verifiers and further questions
we understand “is this a chair?” as clearly separate from “would the person who taught me the chair notion consider it a chair?”. it is much closer to “should the person who taught me the chair notion consider it a chair?”. it is also close to “should i consider it a chair?”
important basic point here: our dog thing is NOT a classifier. classifiers or noticing trick circuits can be attached to our dog structure but the structure is not a classifier
toy problem here: how do you pin down the notion of a proof? (how did we historically?) how do you pin down the notion of an integral? (how did we historically?) maybe study these actual examples
pinning down the notion of a proof might be a good example to study in detail. like, how does one become able to tell whether something is a good proof? a valid reasoning step? how does one start to reason validly? one reason to be interested in this is that it’s analogous to: how does one become able to tell what’s good, and come to act well? both are examples of getting some sort of normativity into a system
another example: we have a notion of truth, not just some practical thing like provability (or in a broader context supporting action well maybe). our notion of truth is separate from our notion of provability eg because we have the “axiom/principle” when talking about truth that exactly one of a sentence and its negation is provable, or alternatively/equivalently we have an inference rule of going from “P is not true” to “not-P is true”, and such a rule is just not right for provability (there are sentences such that the sentence and its negation are both not provable). by gödel’s completeness theorem, i guess a fine notion of truth, ie one which has a model, is precisely one which assigns 0⁄1 to all sentences and is coherent under proving. we operate with truth by relying on these properties, without having a decision algorithm or even a definition for truth (cf tarski’s thm).
how did we understand what an integral is?
i think we were using integrals for like two centuries before we knew how to properly define them (eg via riemann sums). how come we were pretty successful with that? like, how come we did all this cool stuff, we came to all these correct conclusions, without properly knowing what integrals are? i think the general thing that happened is that we hypothesized an object with some properties and these properties turned out to be those of a real thing, and in fact to pin it down uniquely! though of course this leaves the following important question: how did we identify this set of properties as important?
″
But continental Europe historically and China today offer some counter evidence, as they’re technologically competitive without having a comparably competent philosophical tradition.
continental europe historically seems like a clear example of high technological competence together with high philosophical competence (both measured relative to the time)
today’s US has much higher incarceration rate than today’s China
i’d guess that the incarceration rate among chinese americans is at most roughly as large as the incarceration rate in china though. [1] controlling for the two countries having different people seems important if we’re trying to assess the repressiveness of each country’s governing system. (that said: chinese americans are also richer than chinese chinese, and one would want to control for that as well, introducing a correction in the other direction)
(that said: my overall position is that it is very bad for the US to race with china)
assorted thoughts in response:
I definitely want people to think more about what AIs would think and do over a lot of reflection/development, and when more powerful. People should think more about the effects of a mind. People should think of the AGI situation as us probably having to correctly determine the future via an extremely long causal chain. [1]
I don’t think it’s weird to speak of values the way I’m speaking of values. I think people accept this sort of value-talk in other contexts. E.g. it’s common for antirealists to think of ethical truths as being determined by some ideal reflection; e.g. the notion of CEV. I think people who in some contexts use “egregious misalignment” in this “egregious misbehavior in mundane situations” sense also sometimes make inferences as if they were using “misalignment” in the sense I suggest. That said, one could want to make a distinction between reflection and development-in-general, and certainly it makes sense to distinguish between more and less endorsed forms of development. I think I was somewhat sloppy with this in my first comment.
I think it’d in principle be fine for some ideal beings to use words however. In practice, [people are stupid]/[thinking is difficult], and it’s very natural to make the inference “the AI is egregiously misaligned”
“the AI wants to egregiously misbehave in normal circumstances” and also to make the inference “the AI endorses each step of a process which leads to all humans dying” “it was egregiously misaligned”, but I think there isn’t a concept that supports both of these inferences at once (or at least I think our language should leave this as an open question). So, I mostly don’t endorse using “catastrophic/egregious/large misalignment”, and trying to say what one means in other words. I should maybe have used different words in my first comment as well. I don’t have good alternative terms to suggest atm, except saying what one means with more words. I guess I’d want more people to try spending some time thinking about the AI situation while tabooing a bunch of Constellation-speak and MIRI-speak, building up their own Entish.
- ↩︎
Some people think they can avoid this difficulty by having a first mess-AI “solve alignment” and launch some sort of aligned ASI sovereign, with the first AI not being that weird. I think that to first order one should think of this as the original AI trying to determine the future via a bottleneck. And in real life, people would plausibly just let the AI self-improve with some monitoring lol, in which case it’s not exactly a tight bottleneck. The original AI will also already be doing a lot of reflection and development. Also, there will be a long chain of causality after the ASI sovereign that needs to go right. (Also, in practice, instead of some clever scheme with boxed AIs solving alignment, we will probably just get some total mess with AIs deployed broadly, connected to the internet, plausibly just running AI labs. And there’s the AIs breaking out, and there’s fooming being fast, and there’s not having much time to be careful.)
I think that if we try to make sense of “what a current AI would do after reflecting+developing for a long time”, that thing does not involve being nice to humans. I think it’s still not nice to humans if we add the constraint “and the reflection/development process has to be basically [endorsed by the AI]/[good according to the AI]”. I think it’s pretty standard to take what you would do [after a lot of reflection + if you were more powerful] to reflect your values better than what you would do instinctively. So, if I’m right about what would happen given further (self-endorsed) development, it seems like a standard use of language (at least in alignment and in philosophy) + true to say current AIs are bad? I’d agree it is also pretty standard + [maybe true] to say “current AIs are good” in the sense that they mostly have pretty acceptable instinctive behaviors. This situation is pretty unfortunate, and maybe calls on us to start explicitly making this distinction. [1]
- ↩︎
“Catastrophic misalignment” is a bad term, in addition to the reason I already gave in my comment, also because it could mean that this AI in fact would cause a catastrophe (without human help), which I don’t think is true for current AIs. That said, I think that’s prevented by capabilities, not by alignment — I think the closest thing to a current AI which is capable of causing a catastrophe would cause a catastrophe. I guess maybe one should say “misalignment sufficient for a catastrophic outcome if choosing the future were handed to the AI”.
- ↩︎
I think current AI systems are likely catastrophically misaligned, but instead of properly arguing for it here, I want to clear the much lower bar of making the position sound much less weird than it might at first. When I imagine a person to whom this position sounds weird, I imagine them saying sth like:
“AIs are acting nicely in various contexts. They look nice in our evaluations, and they look nice to users in practice. Isn’t it unlikely that they are really evil, hiding it, waiting to strike?”
While I think it’s likely that current AI systems are catastrophically misaligned, I don’t much feel like taking a position one way or the other about the “are really evil, hiding it, waiting to strike” part. I think the hypothetical interlocutor above is making a false equivalence. When I say current AI system are “catastrophically misaligned”, what I have in mind is this:
When someone sets up some initial AI system and lets it develop a lot (ie lets it do RSI) [1] , with anything like this that could be done in practice [2] , this doesn’t go well for humans. I think that the default without strict regulation of AI development is: in the first 10 years after AGI (by which I mean AI that autonomously does conceptual research better than top humans), there will be a lot of development — like probably more development than there has been in total in all of history. [3] Like, after developing for a lot of “subjective time”, the AI systems that come out of this development process would trivially be able to replace humans with whatever other processes from some vast number of options; the negentropy/[free energy]/atoms I’m currently using could probably be used to run
processes of similar complexity/interestingness. Despite it being trivial for the AI to do this, the AI needs to not do this (or, maybe disassemble me, but at least recreate me on a computer, I guess...). In fact, the AI doesn’t just need to leave me alone, it needs to protect me from being killed by any other beings, and make sure I have a bunch of resources so I can live a long life. It’s kinda like I need to be very close to the coolest possible process to this AI, despite being “objectively” extremely boring, slow, wasteful, with “objectively” nothing to offer to the AI. This seems like a really sharp property; it feels like a measure 0 sort of thing. Preserving this forever feels especially sharp. I think it’s unlikely that this property would be upheld. I don’t think it is that reassuring if this long development process is started by AIs whose cached policies for mundane situations are pretty nice(-looking). [4]
Maybe this at least makes it seem not weird to think that current AI systems are catastrophically misaligned. It’s plausible we’re just using the same words differently, but in that case I think my use better tracks the niceness-type property that really matters. Like, it ultimately matters whether our AIs will continue to protect us
forever when everything is up to them, not whether they behave nicely in mundane interactions now. I guess the terms “catastrophic/egregious misalignment” or “a large amount of misalignment” are quite unfortunate because it’s sort of unclear if one should read them as [misalignment sufficient for things to end up being really bad] (in that case, given doomy views, even an extremely small failure to set valuing up properly constitutes catastrophic/egregious/large misalignment, and it’s plausible to me that of humans are egregiously misaligned by default, tho I’m not sure [5] ) or as [the AI wanting to behave egregiously badly in mundane circumstances]. I think that there being these two really different interpretations of the same term has caused a bunch of confused thinking by people in alignment.- ↩︎
this could be framed as asking the AI to develop a good successor; the initial setup might have some processes tasked with “solving alignment”; there might be multiple AIs involved doing different things, eg there can be monitors
- ↩︎
absent fundamental breakthroughs in alignment
- ↩︎
in practice, the only way to regulate this is by banning AGI or by having some AI(s) effectively take over the world and then self-regulate
- ↩︎
My guess is also that things will also naively be looking worse once we get to AIs that are actually able to do research autonomously, because these AIs will be less based on human imitation, they will be actually able to come up with new thinky-stuff (new words/concepts/ideas/methods etc), they will not have nice chains of thought, and they will be more trained on clearly inhuman things like doing math/coding/science/tech.
- ↩︎
it maybe also depends on what self-improvement affordances are made available to a human
Who is working on this sort of thing?
Here’s a bunch of stuff off the top of my head, in no particular order, including people who aren’t thinking much about the issue in full generality, but are addressing aspects: [1]
economics has the subfields of social choice theory and mechanism/incentive/institution design. public economics is also relevant. internalizing externalities
there’s a bunch of econ stuff on people coordinating in/as a firm
there is a lot of political philosophy/theory/science on what sorts of political institutions we ought to have. eg see here for a bunch of pointers to
contemporary thinking on sortition, or see communist proposals for how we should coordinate, or see anarcho-capitalists proposalsvarious groups are trying to get money out of politics, eg trying to get Citizens United v. FEC overturned. there are various anti-corruption groups and pro-transparency groups
there have been various attempts to establish a world government
the legal system is one of the main instruments society has for acting on its values and determining facts in specific cases. there’s a lot of work on what it should be like
i think a bunch of sociologists are studying polarization and social media echo chamber stuff
there’s a bunch of work on how to inform people / how to get people to pay attention / how to get people to believe something. eg advertising research, work on how to run propaganda campaigns, theory of journalism
there’s a bunch of work on how to make people able to understand stuff: education theory, designing curricula, teaching
there are various forecasting and (specifically) prediction market initiatives, eg metaculus and manifold
people who created and run twitter community notes, fact-checking in general
people running wikipedia
people running the alignment forum and lesswrong
work on reputation systems
metascience and in particular replication crisis stuff. people trying to improve academic publishing, peer review, academic credit assignment
the field of social epistemology. also just epistemology
there’s a bunch of work on the social and bioevolutionary development of cooperation and trust and trustworthiness. there’s psychology research and self-help stuff on developing into a trustworthy person
there’s a lot of work on how to reduce crime
probably many other directions in sociology and social theory
So, there’s a huge amount of work broadly on coordination. Maybe there should be a more systematic body of understanding here. Maybe there should be an academic field. My personal term for this is “weltgeistbehandlung”. Copying a note I wrote on this for myself:
“In a broad sense, “weltgeistbehandlung” just means improving the world. In a stricter sense, it’s about improving the more living parts over the inert parts (like, improving the academic credit assignment system, not making buildings more beautiful), the more procedural/meta parts over the more object-level parts (like, reducing dysfunction in democratic systems over reducing animal suffering). Even more strictly, it is about improving the more think-y parts of the world: about making the world more truth-tracking, about making the world generate new ideas faster when a need arises, about making decision-making more guided by the best thinking, about making it so our values are worked out more fully, about making it so our values are better heard when decisions are made.
related but clearly non-synonymous: social epistemology, metascience, incentive design, institutional design. i think tikkun olam is somewhat similar. LATER EDIT: Daniel Schmachtenberger’s The Consilience Project seems very similar
characterizing weltgeistbehandlung:
it is somewhat less a science and more an engineering discipline. it’s like medicine / medical science, but we’re healing the world-spirit
a central theme is setting up incentives, setting up hyperparameters, pushing the world toward goodness, with the heavy lifting being done by blind local mess incentives (even by stuff like greed and status-seeking), as opposed to being done by some pure correct judgments of goodness operating locally. like, if we’re setting up incentives with goodness in mind, ultimately the good stuff that happens is (to the extent that we’re successful) caused by a judgment of goodness, but this is happening indirectly. it’s about nudging a mad weltgeist subtly so it propels itself toward goodness. it’s about making goodness rewarded, comfortable, easy. it’s about making good processes/institutions/agents/etc outcompete others. it’s about preserving and expanding the niche/purpose of each good thing. it’s about making goodness win.
i think it should set out to be looking mostly for pareto improvements. despite being sort of about organizing our polis, it could still be kinda apolitical. that said, sometimes some groups just have to lose (eg people who explicitly want to make AIs even if they cause human extinction, eg paid lobbyists or companies effectively buying policies)
important components of weltgeistbehandlung:
coming up with general components for schemes. like patent auctions, prediction markets, accountability mechanisms
constructing particular incentive-fixing/goodness-promoting proposals
analyzing decisions between options (like, which voting scheme should we have?)
implementing these proposals (like what a doctor does)
identifying issues: like, noticing that there is a lot of lying in US business and politics, noticing that one isn’t sufficiently incentivized to provide some certain public good, noticing that academia is goodharting in various ways, etc”
- ↩︎
I’ll be taking a somewhat broad view on what counts as a “coordination failure”, as you seem to be taking.
It indeed seems deeply unnatural for a very smart AI to look at the human world from the outside, be able to replace it with
whatever, and be like: “no, i’m not going to use these atoms and this negentropy/energy for anything else — this human world that is here by default is the best thing that could be here; in fact, I will make sure it has a lot of resources to flourish in the future”. It seems [deeply unnatural]/[extremely sharp] for anyone to have values like this. I think it’s unlikely that even humanity-after-developing-correctly-for-a-million-years would think like this if it encountered another Earth with a current-humanity-level alternate humanity on it. [1]One approach to tackling this difficulty is to try to somehow make an AI that does this imo deeply unnatural thing anyway. But there is also the following alternative approach: to try to make it so there is not anyone that is judging the human world from the outside like this — i.e., that it’s just the human world judging itself. The judgment “we are cool, we have lots of cool projects going on, and we definitely should avoid killing ourselves” is very natural; in particular, it is much more natural than the judgment the AI looking at the human world from the outside needs to make. I think this alternative path
requires banning AGI.One more alternative approach (that overlaps with the previous one): one can also hope to have humans flourish for a long time without any judgment that humans are very cool directly controlling local decision-making. Instead, we can try to set up local incentives so that goodness/humanness is promoted. This way, humans might be able to flourish even in a “hot mess” world. For this, it is crucial that humans and human institutions remain useful. So, this also requires banning AGI.
- ↩︎
Indeed, human civilizations have historically not treated less developed civilizations with much kindness.
- ↩︎
It seems plausible that what you suggest is one significant contributor. Here’s one more thing that imo plausibly contributes significantly:
Most of these people are consequentialists, i.e. they think of ethics in terms of sth like designing a good spacetime block. [1] Like, when making a decision, you are making a decision as if standing outside the universe and choosing which of two spacetime blocks [2] is better. Given this view of ethics, it is very natural to imagine a future in which there actually is some guy that designs/chooses a good spacetime block, and it becomes somewhat less natural to imagine futures in which the spacetime block keeps getting “designed/chosen” in a messy way by all the messy stuff inside the spacetime block, with the designing/choosing and the being-valuable done by the same entities. A person who thinks in terms of duties or a person who thinks in terms of virtues would find it much less natural to have such a strong separation between the locus of moral-agent-hood and the locus of moral-patient-hood.
some additional recent AI x-risk things by Bernie Sanders:
it (correctly) claims to be so
it’s a bit complicated but there is a sense in which the following is true:
Taiwan does not currently claim to be a separate country from mainland China. There are currently two governing systems claiming to be the legitimate government of the entirety of China: one is in Taiwan, and the other in the mainland.
I haven’t thought a lot about this but my guess is that this approach basically can’t work because chaos is a thing so you need to determine parameters on the fly so you need to put some controllers inside
edit: oh i guess maybe you’re suggesting controlling cell division directly very precisely with optical stimulation at precise points inside the strawberry somehow? hmm. i guess you also need to control cell death very precisely
edit 2: oh also you have a major chicken and egg problem with the ovules and the surrounding structure in the parent plant right?
Aren’t we extremely confused about how one would go about making two strawberries which are identical down to the cellular level? Like, the simplest path might go through nanotech or some other pretty crazy thing? (Being able to do that probably implies it wouldn’t be much harder to mass-manufacture humans who are identical down to the cellular level?) I feel like you’re saying you basically know how to reduce it to a bunch of grad student gruntwork (or at least think someone else could) and that sounds really wild to me!
I think this doesn’t make sense capabilities-wise for solving genuinely hard scientific/technological/mathematical/philosophical problems such as the strawberry problem. (It makes sense when the big task has a basically known decomposition into a large number of small easy tasks though.) A central issue is that good high-level decisions are very important, there are very many of them, and they
need to be made with deep understanding of the domain/[design space], which the human in this setup doesn’t have by default for hard problems and which can only be gained by spending a lot of time understanding novel stuff. Like, it would be extremely silly to have a setup in which a human with no university-level math education is suggesting high-level riemann hypothesis proof strategies to Terence Tao. That human could not be contributing basically anything positive to Tao’s ability to solve the problem.Maybe the following is a key observation in this (you might have considered it already, but including it just in case):
The example we should have in mind is NOT having a research problem that takes an individual human 1000 years to solve, which has some clever decomposition into 5 problems, each of which takes 200 years to solve, with the human needing to provide the 5 problem decomposition and the AI solving the 200-year problems. This is NOT what we should have in mind because if the AI can solve 200 year problems, then we are already very close to making an AI that can autonomously solve 1000 year problems (in fact in practice for these particular numbers I expect we would be there basically instantly). Instead, for the question to be interesting, we should imagine the AI being able to solve much shorter problems, like idk 1-week problems. In that case, even if there is some reasonable decomposition into eventually 1-week tasks, it will be a really big complicated object.
One also gets a bound on the capabilities-usefulness of this scheme from the consideration that if decomposition work is easy enough for a task, then one should just be able to have an AI with a small time horizon do it as well, at least if we trust time-horizon-thinking. And so either decomposition work is easy and you could replace the human in this scheme with that AI (or better yet, just have the AI solving the subproblems also make decomposition decisions) or decomposition work is hard and it takes a long time for this AI-human system to do the task. Quantitatively, this is saying that if a task can be done by a human-AI system but not an only-AI system, then it should take at least the AI time horizon in wall clock time. I guess this conclusion would be softened if decomposition-work is outlier-hard among things with the same human time horizon, which seems plausible.
That said, one can of course get some speedup as a human researcher from asking AIs to sometimes do small tasks, I’m just doubting that this can give a huge speedup for solving hard scientific/technological/mathematical/philosophical problems without the AI being basically able to solve them autonomously.
My experience is extremely different from yours. I think almost all the non-[rat/EA] people in my life whose positions on this I know consider it plausible that an AI substantially smarter than any human will be created this century. [1] Thinking of the set of non-[rat/EA] friends/[close-ish acquaintances] I haven’t discussed this topic with yet, my guess is that more than half of them already think this and almost all of them would think this after a 2 hour conversation with me. It’s probably important that my distribution skews very high iq (maybe importantly both quant and verbal) [2] and high openness. [3]
- ↩︎
this includes e.g. the 4 family members I’ve discussed this topic with
- ↩︎
like, these are mostly people I know from the international olympiad circuit, math and physics majors from my MIT undergrad, and classmates from the best high school in Estonia
- ↩︎
Some of them deferring to me partly on the question is probably also doing some work tbh, but I think this isn’t a big enough effect to change the broad strokes conditional on getting them to consider the hypothesis at all.
- ↩︎
Imo your plea, as currently written, moderately anti-contributes to truth-seeking norms. There’s a missing mood imo. Like, I would have phrased it as eg:
First, if it is all hype, then it is good that you are saying it is all hype compared to not saying anything, even if various people ignore you. That said, you should treat your time/thought/writing as a valuable resource: it makes sense to assess what true things you are best positioned to help others understand and are most important for others to understand, and to focus on helping others understand those. And I think you might be wrong about this being such a truth, for the following reasons: …
(This could have been a preamble, or a post-preamble for attentional reasons.)
I think the version I suggest is significantly less norm-eroding, though it should still be viewed with some suspicion if one says this sort of thing selectively in response to positions one disagrees with.
One could claim that many of these people are themselves already corrupt consequentialists in their talking, i.e. already not in some sort of truth-seeking community, and it is just fine to “advise these criminals on how to do crime better” if that has good local consequences? I don’t think that’s fine — I think that would be a mistaken assessment of these people, and I think one should be trying more to bring even the people about whom this assessment is correct into the fold.
Since my Claim 1 is about the conceptual work input being 100x sped up, not some final output being 100x sped up, I’ll take you to be disagreeing with Claim 2. So the question is: is 10 years of thinking about AI algorithms followed by 1 month of retraining sufficient [to get from AI that causes
of people to be permanently unemployable to crazy smart AI]? In other words, if one is only going to be able to pick low-hanging and medium-hanging fruit in 10 years, is picking those sufficient to get to crazy smart AI from that point? I claim that the answer is yes; some quick points:I think we should imagine the fruits at the beginning of this to not have been well-picked (supposing a crazy smart AI does not already exist).
Trusting Byrnes’s decomposition of the 7 year 600x nanochat cost improvement, that’s 6x from hardware and 100x from non-hardware. That would give some sort of baseline guess of
for 10 years. Ok, but maybe we should apply some adjustments to the factors. In particular, what about data? On the one hand, it will be tough to collect a lot of data from humans quickly in our scenario. On the other hand, it will be very easy to collect [a lot more data [than we have from humans]] from AIs in this scenario, and by that point this will probably be overall better. On the first hand again, maybe we should imagine data not mattering so much at that point. On the second hand again, all things considered that’s actually conceptually correlated with fooming far past human level quickly. We should also apply some global adjustment down for having less time for experiments to run.Byrnes explicitly does not include algo ideas that “are not about doing the same thing more efficiently, but rather about trying to do something different instead”. See Section 1.5 of his post. But these clearly should be included in our context here, and are majorly important imo. E.g., curating curricula, creating problems for oneself in a different way, coming up with good ways to reward problem-creation, creating more nested levels of problem-solving with their own rewards, coming up with other ways to make rewards denser / track progress better, creating tools for oneself, various IDA ideas (beyond those already mentioned), etc.. There are also various ways humans get smarter over centuries and over a lifetime that should also count for our purposes as “algo progress” if the AIs can carry them out, e.g. inventing+acquiring new concepts, questions, methods, and skills, and just knowing more.
In our scenario, coming up with an arbitrarily different new AI design is also legitimate, as long as this AI can be created/trained/grown in at most 1 month.
Tbh a lot of my belief that you get a lot of progress just comes from it being an extremely high-dimensional design space and there surely being lots of things one can do so much better in there.
This is very much unobvious to me, but now that you say this, I realize that I anchored too hard on a specific scenario where the world has gone very hard on just automating away all the economic tasks/roles that can be automated away with advanced robotics and LLMs+++, while humans largely coordinated this fleet in cases that they wouldn’t handle.
But generally, like, to grant the assumption, suppose that 60% are not employable and 40% are employable. Why is this 40% employable? (I think I also took this to be a somewhat stable situation, for some time, not a mean value theorem sort of thing.) Presumably, because there are things that AI still doesn’t do well. Maybe it’s “just” because robotics is annoyingly hard, but it sounds more plausible to me that (also) AI still is not human-thinking-complete, which makes me somewhat sceptical about this massive conceptual algorithm progress speedup.
this makes me want to ask: are you tracking the difference between the event “50
of current human jobs are basically automated” and the event “50 of humans are such that it basically does not make sense to employ them”. like, the former has probably happened multiple times in history, whereas the latter is unprecedented. what you’re saying makes more sense to me if you have the former in mind, but we’re talking about the latter (“people being permanently unemployable”). i have significant probability that you are tracking this correctly already but wanted to check just in case(I think I also took this to be a somewhat stable situation, for some time, not a mean value theorem sort of thing.)
(to make sure we’re on the same page: in my view, this is unlikely to be a somewhat stable situation)
I agree with your point in the canonical solomonoff sequence prediction case. I think your point is what I mean in my note by “you have pathology of not specifying even the hypothesis in the seq prediction case (like it’ll be better to drop bits and take the likelihood loss)”. I think this pathology is maybe not present in “function solomonoff” (I state this in the note as well but don’t really explain it), though I’m very much uncertain.
to state the hopeful “function solomonoff” story in more detail:
By “function solomonoff”, I mean that we have a data set of string pairs , and we think of one hypothesis as being a program that takes in an and outputs a probability distribution on strings from which is sampled. Let’s say that we are in the classification case so always (we’re distinguishing pictures which show dogs vs ones which don’t, say).
The “canonical loss” (from which one derives a posterior distribution via exponentiation) here would be the length of the program specifying the distribution plus the negative log likelihood assigned to summed over all . What I’m suggesting is this loss but with a higher coefficient on the length of the program than on the likelihood terms.
Suppose that the classification boundary is most simply given by what we will consider a “simple model” of complexity bits, together with “systematic human error” which changes the answer from the simple model on a fraction of the inputs, with the set of those inputs taking bits to specify.
If we turned this into sequence prediction by interleaving like , then I’d agree that if we penalize hypothesis length more steeply than likelihood, then: over getting a model which does not predict the errors, we would get a universal-like hypothesis, which in particular starts to predict the human errors after being conditioned on sufficiently many bits. So the idea of more steep penalization of hypothesis length doesn’t do what we want in the sequence prediction case. But I have some hope that the function case doesn’t have this pathology?
Some models of the given data in the function case:
the “good model”: The distribution is given by the simple model with a probability of flipping the answer on top (independently on each input). This gets complexity loss like plus something small for specifying the flip model, and its expected neg log likelihood is .
the “model that learns the errors”: This should generically take bits to specify, and it gets expected neg log likelihood.
the “50/50 random distribution” model: This takes bits to specify and has bit of expected neg log likelihood.
some “universal hypothesis model”: I’m not actually sure what this would even be in the function setting? If you handled the likelihood part by giving a global string of random bits which gets conditioned on other input-output pairs, then I agree we could write something bad just like in the sequence prediction case. But if each input gets its own private randomness, then I don’t see how to write down a universal hypothesis that gets good loss here.
So at least given these models, it looks like the “good model” could be a vertex of the convex hull of the set of attainable (hypothesis complexity, expected neg log likelihood) tuples? If it’s on the convex hull, it’s picked out by some loss of the form described (even in the limit of many data points, though we will need to increase the hypothesis term coefficient compared to the sum of log likelihoods term as the data set size increases, ie in the bayesian picture we will need to pick a stronger prior when the data set is larger in this example).
that said:
Maybe I’m just failing to construct the right “universal hypothesis” for this example?
It seems plausible that some other pathology is present that prevents nice behavior.
I haven’t spent that much time trying to come up with other pathological constructions or searching for a proof that sth like the good model is optimal for some hyperparameter setting.
I can see some other examples where this functional setup still doesn’t work nicely. I might write more about that in a later comment. The example here is definitely somewhat cherry-picked for the idea to work, though I also don’t consider it completely contrived.
I think it’s very unlikely this steeper penalization is anywhere close to a full solution to the philosophical problem here. I only have some hope that it works in some specific toy cases.