DMs open.
Cleo Nardo
Does AI-automated AI R&D count as “Recursive Self-Improvement”? I’m not sure what Yudkowsky would say, but regardless, enough people would count it that I’m happy to concede some semantic territory. The best thing (imo) is just to distinguish them with an adjective.
(This was sitting in my drafts, but I’ll just comment it here bc it’s very similar point.)
There are two forms of “Recursive Self-Improvement” that people often conflate, but they have very different characteristics.
Introspective RSI: Much like a human, an AI will observe, understand, and modify its own cognitive processes. This ability is privileged: the AI can make these self-observations and self-modifications because the metacognition and mesocognition occur within the same entity. While performing cognitive tasks, the AI simultaneously performs the meta-cognitive task of improving its own cognition.
Extrospective RSI: AIs will automate various R&D tasks that humans currently perform to improve AI, using similar workflows that humans currently use. For example, studying literature, forming hypotheses, writing code, running experiments, analyzing data, drawing conclusions, and publishing results. The object-level cognition and meta-level cognition occur in different entities.
I wish people were more careful about the distinction because people carelessly generalise cached opinions about the former to the latter. In particular, the former seems more dangerous: there is less opportunity to monitor the metacognition’s observation and modification of the mesocognition if these cognitions occur within the same entity, i.e. activations, chain-of-thought.
Introspective RSI (left) vs Extrospective RSI (right)
Whatever governance preventing the overlords from appearing could also be used to prevent the humans from wasting resources in space. For example, by requiring that distant colonies are populated with humans or other minds who are capable of either governing themselves or being multilaterally agreed to be moral patients (e.g. this excludes controversial stuff like shrimps on heroin).
Why do you think that requiring that distant colonies are populated with humans would prevent wasting resources in space?
My guess is that, on a mature population ethics, the best uses of resources—on purely welfarist values, ignoring non-welfarist values which I do think are important—will look either like a smaller population of minds much “larger” than humans (i.e. galactic utility monsters) or look like a large population of minds much “smaller” than humans (i.e. shrimps on heroin).
It would be a coincidence if the optional allocation of resources involved minds which were exactly the same “size” as humans.
Note that this would be a coincidence on any of the currently popular theories of population ethics (e.g. average, total, variable-value).
People might not instruct the AI to make the future extremely good, where “good” means actually good.
Of course, if you think P(AI takeover) is 90%, that would probably be a crux.)
I think that (from a risk neutral total utilitarian perspective) the argument still goes through with 90% p(ai takeover). but the difference is that when you condition on no ai takeover the worlds looks weirder (e.g. great power conflict, scaling breaks down, coup has already happened, early brain uploads, aliens) which means:
(1) the worlds are more diverse so the impact of any interventions has greater variance, and less likely to be net positive (even if it’s just as positive in expectation)
(2) your impact is lower because the weird transition event is likely to wash out your intervention
Why would demand for AI inference be below 167 tokens/second/american? I expect it to be much higher, and for energy to be a constraint.
Strategy-Stealing Argument Against AI Dealmaking
I except dealmaking is unnecessary for extracting safe and useful labour from that minimal viable AI.
It’s difficult to make credible deals with dumb AIs because they won’t be smart enough to tell whether we have actually ‘signed the contracts’ or not. Maybe we’re simulating a world where we have signed the contracts. So the deals only work when the AIs are so smart that we can’t simulate the environment while deluding them about the existence of contracts. This occurs only when the AI is very smart or widely deployed. But in that case, my guess is they have DSA.
I don’t think dealmaking will buy us much safety. This is because I expect that:
In worlds where AIs lack the intelligence & affordances for decisive strategic advantage, our alignment techniques and control protocols should suffice for extracting safe and useful work.
In worlds where AIs have DSA then: if they are aligned then deals are unnecessary, and if they are misaligned then they would disempower us rather than accept the deal.
That said, I have been thinking about dealmaking because:
It’s neglected, relative to other mechanisms for extracting safe and useful work from AIs, e.g. scalable alignment, mech interp, control.
There might be time-sensitive opportunities to establish credibility with AIs. This seems less likely for other mechanisms.
There are many questions where verification is no easier than generation, e.g. “Is this chess move best?” is no easier than “What’s the best chess move?” Both are EXPTIME-complete.
Philosophy might have a similar complexity to ’What’s the best chess move?”, i.e. “What argument X is such that for all counterarguments X1 there exists a countercounterargument X2 such that for all countercountercounterarguments X3...”, i.e. you explore the game tree of philosophical discourse.
Whether experiments serve as a distinction between science and philosophy, TW has a lecture arguing against this, and he addresses this in a bunch of papers. I’ll summarise his arguments later if I have time.
To clarify, I listed some of Williamson’s claims, but I haven’t summarised any of his arguments.
His actual arguments tend to be ‘negative’, i.e. they goes through many distinctions that metaphilosophical anti-exceptionalists purport, and for each he argues that either (i) the purported distinction is insubstantial,[1] or (ii) the distinction mischaracterised philosophy or science or both.[2]
He hasn’t I think addressed Wei Dai’s exceptionalism, which is (I gather) something like “Solomonoff induction provides a half-way decent formalisms of ideal maths/science, but there isn’t a similarly decent formalism of ideal philosophy.”
I’ll think a bit more about what Williamson might say about that Wei Dai’s purported distinction. I think Williamson is open to the possibility that philosophy is qualitatively different from science, so it’s possible he would change his mind if he engaged with Dai’s position.
Also I’m imaging that everyone stays on Earth and has millions of copies in space (via molecular cloning + uploads). And then it seem like people might agree to keep the Earth-copies as traditional humans, and this agreement would only affect a billionth of the Joseph-copies.
Yep, this seems like a plausible bargaining solution. But I might be wrong. If it turns out that mundane values don’t mind being neighbours with immortal robots then you wouldn’t need to leave Earth.
How Exceptional is Philosophy?
Wei Dai thinks that automating philosophy is among the hardest problems in AI safety.[1] If he’s right, we might face a period where we have superhuman scientific and technological progress without comparable philosophical progress. This could be dangerous: imagine humanity with the science and technology of 1960 but the philosophy of 1460!
I think the likelihood of philosophy ‘keeping pace’ with science/technology depends on two factors:
How similar are the capabilities required? If philosophy requires fundamentally different methods than science and technology, we might automate one without the other.
What are the incentives? I think the direct economic incentives to automating science and technology are stronger than automating philosophy. That said, there might be indirect incentives to automate philosophy if philosophical progress becomes a bottleneck to scientific or technological progress.
I’ll consider only the first factor here: How similar are the capabilities required?
Wei Dai is a metaphilosophical exceptionalist. He writes:
We seem to understand the philosophy/epistemology of science much better than that of philosophy (i.e. metaphilosophy), and at least superficially the methods humans use to make progress in them don’t look very similar, so it seems suspicious that the same AI-based methods happen to work equally well for science and for philosophy.
I will contrast Wei Dai’s position with that of Timothy Williamson, a metaphilosophical anti-exceptionalist.
These are the claims that constitute Williamson’s view:
Philosophy is a science.
It’s not a natural science (like particle physics, organic chemistry, nephrology), but not all sciences are natural sciences — for instance, mathematics and computer science are formal sciences. Philosophy is likewise a non-natural science.
Although philosophy differs from other scientific inquiries, it differs no more in kind or degree than they differ from each other. Put provocatively, theoretical physics might be closer to analytic philosophy than to experimental physics.
Philosophy, like other sciences, pursues knowledge. Just as mathematics peruses mathematical knowledge, and nephrology peruses nephrological knowledge, philosophy pursues philosophical knowledge.
Different sciences will vary in their subject-matter, methods, practices, etc., but philosophy doesn’t differ to a far greater degree or in a fundamentally different way. (6) Philosophical methods (i.e. the ways in which philosophy achieves its aim, knowledge) aren’t starkly different from the methods of other sciences.
Philosophy isn’t a science in a parasitic sense. It’s not a science because it uses scientific evidence or because it has applications for the sciences. Rather, it’s simply another science, not uniquely special. Williamson says, “philosophy is neither queen nor handmaid of the sciences, just one more science with a distinctive character, just as other sciences have distinctive character.”
Philosophy is not, exceptionally among sciences, concerned with words or concepts. This conflicts with many 20th century philosophers who conceived philosophy as chiefly concerned with linguistic or conceptual analysis, such as Wittgenstein, Carnap.
Philosophy doesn’t consist of a series of disconnected visionaries. Rather, it consists in the incremental contribution of thousands of researchers: some great, some mediocre, much like any other scientific inquiry.
Roughly speaking, metaphilosophical exceptionalism should make one more pessimistic about philosophical progress keeping pace with scientific and technological progress. I lean towards Williamson’s position, which makes me less pessimistic about philosophy keeping pace by default.
That said, during a rapid takeoff, even small differences in the pace could lead to a growing gap between philosophical progress and scientific/technological progress. So I consider automating philosophy an important problem to work on.
- ^
See AI doing philosophy = AI generating hands? (Jan 2024), Meta Questions about Metaphilosophy (Sep 2023), Morality is Scary (Dec 2021), Problems in AI Alignment that philosophers could potentially contribute to (Aug 2019), On the purposes of decision theory research (Jul 2019), Some Thoughts on Metaphilosophy (Feb 2019), The Argument from Philosophical Difficulty (Feb 2019), Two Neglected Problems in Human-AI Safety (Dec 2018), Metaphilosophical Mysteries (2010)
Different example, I think.
In our ttx, the AI was spec-aligned (human future flourishing etc), but didn’t trust that the lab leadership (Trump) was spec-aligned.
I don’t think our ttx was realistic. We started with an optimistic mix of AI values: spec-alignment plus myopic reward hacking.
Taxonomy of deal-making arrangements
When we consider arrangements between AIs and humans, we can analyze them along three dimensions:
Performance obligations define who owes what to whom. These range from unilateral arrangements where only the AI must perform (e.g. providing safe and useful services), through bilateral exchanges where both parties have obligations (e.g. AI provides services and humans provide compensation), to unilateral human obligations (e.g. humans compensate AI without receiving specified services).
Formation conditions govern how the arrangement comes into being. Some obligations might exist by default without any explicit agreement, similar to how we expect other humans to not harm us without signing contracts. Others require active consent from one party (either AI or human can create the obligation) or mutual agreement from both parties.
Termination conditions govern how the arrangement can end. Some arrangements might be permanent, others allow unilateral exit by either party, and still others require mutual consent to dissolve.
These dimensions yield 36 distinct configurations[1], many of which map onto familiar arrangements between humans:
Employment contracts: Bilateral performance (AI works, human provides compensation), where formation requires the consent of both humans and AIs, and termination requires consent of either party.
Slavery: Unilateral AI performance, where formation and termination requires the consent of humans only.
Service agreements: Unilateral AI performance, where formation and termination requires the consent of both humans and AIs.
Indentured servitude: Bilateral performance (AI works, human provides compensation), where formation requires mutual consent, but termination requires consent of humans.
Paid conscription: Bilateral performance (AI serves, human compensates), where formation and termination requires the consent of humans only.
Gifts: Unilateral human performance, where formation and termination requires the consent of humans only.
Typically when I talk about ‘deals’ I am referring to any arrangement with bilateral performance. This includes paid conscription, indentured servitude, and employment. It will exclude slavery (where AIs have obligations but humans do not) and gifts (where humans have obligations but AIs do not).
- ^
The possible performance obligations are: (1) AIs have obligations, (2) humans have obligations, (3) both humans and AIs have obligations. The possible formation conditions are: (1) AIs can unilaterally form arrangement, (2) humans can unilaterally form arrangement, (3) either humans and AIs can unilaterally form arrangement, (4) both humans and AIs must mutually agree to form arrangement. The possible termination conditions are similar to possible formation conditions. This gives 4×3×3=36 configurations.
Remember Bing Sydney?
I don’t have anything insightful to say here. But it’s surprising how little people mention Bing Sydney.
If you ask people for examples of misaligned behaviour from AIs, they might mention:
Sycophancy from 4o
Goodharting unit tests from o3
Alignment-faking from Opus 3
Blackmail from Opus 4
But like, three years ago, Bing Sydney. The most powerful chatbot was connected to the internet and — unexpectedly, without provocation, apparently contrary to its training objective and prompting — threatening to murder people!
Are we memory-holing Bing Sydney or are there are good reasons for not mentioning this more?
Here are some extracts from Bing Chat is blatantly, aggressively misaligned (Evan Hubinger, 15th Feb 2023).