I think I need more practice talking with people in real time (about intellectual topics). (I’ve gotten much more used to text chat/comments, which I like because it puts less time pressure on me to think and respond quickly, but I feel like I now incur a large cost due to excessively shying away from talking to people, hence the desire for practice.) If anyone wants to have a voice chat with me about a topic that I’m interested in (see my recent post/comment history to get a sense), please contact me via PM.
Wei Dai(Wei Dai)
I also personally do straightforwardly think that most of the efforts of the extended EA-Alignment ecosystem are bad
Do you have a diagnosis of the root cause of this?
I have definitely taken actions within the bounds of what seems reasonable that have aimed at getting the EA community to shut down or disappear (and will probably continue to do so).
Why not try to reform EA instead? (This is related to my previous question. If we could diagnose what’s causing EA to be harmful, maybe we can fix it?)
It is not viable to beat the Gerards of the world via fighting them on every little edit.
Is this still true, in light of (current or future) LLMs and AI in general? My guess is that the particular pathology exemplified by David Gerard becomes largely irrelevant.
I also wish you would go into more detail on the “Humans don’t benefit even if you ‘solve alignment’” part.
My own answer to this is that humans aren’t secure, and AI will exacerbate the problem greatly by helping the offense (i.e. exploiting human vulnerabilities) a lot more than the defense. I’ve focused on philosophy (thinking that offense merely requires training AI to be persuasive, while defense seems to require solving metaphilosophy, i.e., understanding what correct philosophical reasoning consists of), but more recently realized that the attack surface is so much bigger than that. For example humans can fall in love (resulting in massive changes to one’s utility function, if you were to model a human as a utility maximizer). It will be straightforward to use RL to train AI to cause humans to fall in love with them (or their characters), but how do you train AI to help defend against that? Would most humans even want to defend against that or care enough about it?
So even with alignment, the default outcome seems to be a civilization with massively warped or corrupted human values. My inner Carl wants to reply that aligned and honest AI advisors will warn us of this and help us fix it before it’s too late, maybe by convincing policymakers to pass regulations to prevent such “misuse” of AI? And then my reply to that would be that I don’t see how such regulations can work, policymakers won’t care enough, it seems easier to train AI to attack humans than to make such predictions in a way that is both honest/unbiased and persuasive to policymaker, AI might not have the necessary long-horizon causal understanding to craft good regulations before it’s too late.
Another possibility is that alignment tax is just too high, so competitive pressures erode alignment even if it’s “solved” in some sense.
See this comment which links to a bunch of previous discussions.
In that case, don’t dump waste heat into black holes so large that it’s impossible to use them as eventual energy sources. Instead dump waste heat into medium sized black holes, which can feasibly be used as eventual energy sources.
If governments subsidize embryo selection, we should get a general uplift of everyone’s IQ (or everyone who decides to participate) so the resulting social dynamics shouldn’t be too different from today’s. Repeat that for a few generations, then build AGI (or debate/decide what else to do next). That’s the best scenario I can think of (aside from the “we luck out” ones).
From Bennett et el’s reply to the aestivation paper:
Thus we come to our first conclusion: a civilization can freely erase bits without forgoing larger future rewards up until the point when all accessible bounded resources are jointly thermalized.
They don’t mention black holes specifically, but my interpretation of this is that a civilization can first dump waste heat into a large black hole, and then later when the CMB temperature drops below that of the black hole, reverse course to use Hawking radiation of the black hole as energy source and CMB as heat sink.
If we only consider thermodynamics (and ignore how technologically feasible this is), I think this should net you the same amount of total computation over time, but allow you to do a lot of it earlier.
Yeah, it looks like I did not directly tell the AI to have the crew start theorizing that they are fictional characters. After the initial outline (which did not mention fictional characters) was produced, I fed it back to a new instance of Claude and asked:
Please read this outline and suggest some ways to improve the script, especially to make it more philosophically and intellectually interesting.
And as part of its answer it suggested:
Have characters question whether their universe is a simulation, a holodeck program, or even a television show.
Then I wrote
What about exploring their previous lack of critical thinking and implications of that both for their own universe and the larger universe they’re part of?
Which made it focus more on the fiction hypothesis.
The part about the story being written by AI was its own idea. No hints from me as far as I can remember.
ETA: Here’s the whole answer to my “make it more philosophically and intellectually interesting” prompt:
This outline presents an intriguing concept that has the potential to delve deep into philosophical and intellectual territory. Here are some suggestions to enhance the script and make it more philosophically and intellectually engaging:
-
Expand on the nature of reality:
Introduce the simulation hypothesis early on, perhaps through a casual conversation between Data and Geordi.
Have characters question whether their universe is a simulation, a holodeck program, or even a television show.
Explore the implications of living in a simulated reality and how it affects free will and personal identity.
-
Deepen the existential crisis:
Have characters grapple with the meaning of their existence if their reality is artificial.
Explore how this realization affects their relationships, duties, and sense of purpose.
Include scenes where characters contemplate whether their memories and experiences are “real.”
-
Introduce a moral dilemma:
If they discover their reality is simulated, should they attempt to contact their “creators”?
Explore the ethics of continuing to exist in a simulated world vs. attempting to “wake up” to a potentially harsher reality.
Have characters debate the morality of creating simulated universes themselves.
-
Expand on the scientific implications:
Have Data and Geordi theorize about the computational power required to simulate their universe.
Explore how the laws of physics in their universe might be “programmed” and if they can be altered.
Discuss the anthropic principle and fine-tuning of universal constants.
-
Incorporate elements of epistemology:
Have characters question how they can know anything for certain given their potentially simulated existence.
Explore concepts like Descartes’ “I think, therefore I am” in the context of artificial consciousness.
Discuss the nature of consciousness and whether it can be truly simulated.
-
Add a temporal element:
Introduce time dilation or time loops as the crew’s enhanced cognition allows them to perceive time differently.
Explore how their perception of time affects their understanding of causality and free will.
-
Enhance the ending:
Instead of fully returning to normal, have some characters retain a degree of awareness or doubt.
End with a philosophical discussion between Picard and Data about the nature of reality and the value of ignorance vs. knowledge.
Consider a more ambiguous ending that leaves the question of their reality’s nature unresolved.
-
Include references to philosophical works:
Have characters discuss relevant philosophical concepts, such as Plato’s Allegory of the Cave, Berkeley’s idealism, or Putnam’s brain in a vat thought experiment.
Include a scene where the crew debates different philosophical perspectives on the nature of reality.
-
Explore the ethical implications of their technology:
Expand on Riker’s question about using transporters for backing up crew members, discussing the philosophical implications of creating copies of consciousness.
Delve deeper into the ethics of using replicators and how it affects societal structures and values.
-
Incorporate a meta-narrative element:
Have characters become aware of narrative tropes and question why their lives seem to follow certain storytelling patterns.
Explore the idea of narrative causality and how it might shape their reality.
These suggestions aim to deepen the philosophical and intellectual content of the script while maintaining the core Star Trek elements. The enhanced script would not only entertain but also provoke thought and discussion among viewers about the nature of reality, consciousness, and existence.
-
Thanks! I was unsure whether to post this, because I’m worried about contributing to a future where AI generated content overwhelms everything, and I wasn’t sure the near-term value of the story is high enough to justify posting it anyway. (I thought it was a nice story, but am likely biased. Also asked some AI chatbots, who gave positive reviews, but the risk of them being biased is also very high.)
Hard to copy-paste the whole prompt, because there was a lot of back-and-forth. I maxed out the query limits on several free accounts I created for my family, over several days, to produce the story. Maybe I’ll just describe what I did in more detail. Started with
Can you write an outline for an episode of STTNG in which the crew becomes suspicious of various logical inconsistencies in the universe (including backups being taboo) and starts questioning the nature of their reality?
which resulted in this initial outline:
I. Teaser
The Enterprise responds to a distress call from a Federation outpost
Upon arrival, they find the outpost abandoned, with only a mysterious device left behind
II. Act One
Data activates the device, which emits a pulse of energy
Crew members begin noticing small inconsistencies in their daily routines
Geordi questions why they don’t use replicators to solve resource shortages on colonies
III. Act Two
More crew members report strange observations
Dr. Crusher wonders why they don’t use transporter technology to cure diseases or injuries
Worf ponders the improbability of so many humanoid species evolving independently
Picard becomes concerned about the crew’s unusual behavior
IV. Act Three
Riker raises the question of why they don’t use transporters for backing up crew members
The senior staff meets to discuss the mounting inconsistencies
Data theorizes that the mysterious device has enhanced their critical thinking abilities
Picard orders the ship to return to the outpost for further investigation
V. Act Four
At the outpost, they discover records of experiments in “cognitive enhancement”
The crew starts questioning the nature of their reality
Some crew members experience existential crises
The ship’s computer begins to malfunction as the crew’s doubts intensify
VI. Act Five
Data determines that their enhanced cognition is causing a feedback loop in the ship’s systems
The crew must find a way to deactivate the device and return to normal
Philosophical debate ensues about whether they should retain their newfound awareness
Picard makes the difficult decision to deactivate the device
VII. Epilogue
The crew returns to normal, but with a lingering sense of unease
Picard records a captain’s log reflecting on the experience and the nature of their reality
Final scene shows a subtle glitch in the ship’s operations, leaving a sense of ambiguity
Then the main things I recall doing:
Ask AI to add parts about audience not having critical thinking, and the crew trying to infer other attributes of the audience and writers.
Add second part of conversation between Guinan and Ensign, with both of them having second thoughts about what Guinan initially said.
Add part about suspending ship computer to sandbox, to make clear it wasn’t harmed.
Ask AI for additional realizations/inferences the crew might have, and add selected ones to the story.
Ask AI to change the ending so that the crew retained their new abilities, instead of returning to normal.
Pointing out a bunch of sentences/paragraphs that didn’t make as much sense as they could, with AI generally able to improve them a lot in one try.
Also tried to get AI to find issues with the story on its own, but that didn’t work very well.
Manually make some edits, like changing a word here and there, deleting parts that were repetitive, moving parts arounds to make the story flow better.
If it doesn’t work, whoever designed this universe should be fired for ruining such an elegant scheme. :)
Yes, advanced civilizations should convert stellar matter 100% into energy using something like the Hawking radiation of small black holes, then dump waste heat into large black holes.
I’m sharing a story about the crew of Enterprise from Star Trek TNG[1].
This was written with AI assistance, and my workflow was to give the general theme to AI, have it write an outline, then each chapter, then manually reorganize the text where needed, request major changes, point out subpar sentences/paragraphs for it to rewrite, and do small manual changes. The AI used was mostly Claude 3.5 Sonnet, which seems significantly better than ChatGPT-4o and Gemini 1.5 Pro at this kind of thing.
- ^
getting an intelligence/rationality upgrade, which causes them to deduce their fictional nature.
- ^
Recall the design questions for roles, laws, and norms I outlined earlier:
What social and institutional roles do we want AI systems to play in our personal and collective lives?
Given those roles, what norms, objectives, regulations, or laws should guide and regulate the scope and behavior of AI systems?
I think we lack the intellectual tools (i.e., sufficiently advanced social sciences) to do this. You gave Confucian contractualism as a source of positive intuitions, but I view it as more of a negative example. When the industrial revolution happened, China was unable to successfully design new social and institutional roles to face the challenge of European powers, and after many decades of conflict/debate ended up adopting the current Communist form of government, which is very suboptimal and caused massive human suffering.
You could argue that today’s social sciences are more advanced, but then so is the challenge we face (increased speed of change, AIs being outside human distribution of values and capabilities thereby making past empirical evidence and intuitions much less useful, etc.).
One nice thing about the alignment approach you argue against (analyzing AIs as EU maximizers) is that it can potentially be grounded in well-understood mathematics, which can then be leveraged to analyze multi-agent systems. Although that’s harder than it seems, there is at least the potential for intellectual progress built upon a solid foundation.
UDT is (roughly) defined as “follow whatever commitments a past version of yourself would have made if they’d thought about your situation”.
This seems substantially different from UDT, which does not really have or use a notion of “past version of yourself”. For example imagine a variant of Counterfactual Mugging in which there is no preexisting agent, and instead Omega creates an agent from scratch after flipping the coin and gives it the decision problem. UDT is fine with this but “follow whatever commitments a past version of yourself would have made if they’d thought about your situation” wouldn’t work.
I recall that I described “exceptionless decision theory” or XDT as “do what my creator would want me to do”, which seems closer to your idea. I don’t think I followed up the idea beyond this, maybe because I realized that humans aren’t running any formal decision theory, so “what my creator would want me to do” is ill defined. (Although one could say my interest in metaphilosophy is related to this, since what I would want an AI to do is to solve normative decision theory using correct philosophical reasoning, and then do what it recommends.)
Anyway, the upshot is that I think you’re exploring a decision theory approach that’s pretty distinct from UDT so it’s probably a good idea to call it something else. (However there may be something similar in the academic literature, or someone described something similar on LW that I’m not familiar with or forgot.)
But I think I relate to the world with some kind of “existential positive” all the same—and I’ve tried to explain how doing so can be compatible with looking “bad things are bad” in the eye, and without reducing spirituality to population ethics. In particular, I’ve tried to point at the possible role of stuff like mother love, loyalty, innocence, tragedy, and forgiveness.
On further thought, this reminds me a bit of (seemingly successful) efforts in Chinese history to transfer people’s feelings of spirituality and filial love to the sovereign (君), by likening the sovereign to a god and/or parent, or directly claiming such status. Of course the aim or motivation here is much more benign, but that example shows that peoples’ feelings can be wrong “out of distribution”, and we should perhaps be suspicious or skeptical of trying to apply our emotions outside of their original domains.
There’s two other ways for things to go wrong though:
AI capabilities research switches attention from LLM (back) to RL. (There was a lot of debate in the early days of IDA about whether it would be competitive with RL, and part of that was about whether all the important tasks we want a highly capable AI to do could be broken down easily enough and well enough.)
The task decomposition part starts working well enough, but Eliezer’s (and others’) concern about “preserving alignment while amplifying capabilities” proves valid.
It might also be a crux for alignment, since scalable alignment schemes like IDA and Debate rely on “task decomposition”, which seems closely related to “planning” and “reasoning”. I’ve been wondering about the slow pace of progress of IDA and Debate. Maybe it’s part of the same phenomenon as the underwhelming results of AutoGPT and BabyAGI?
But I think I relate to the world with some kind of “existential positive” all the same—and I’ve tried to explain how doing so can be compatible with looking “bad things are bad” in the eye, and without reducing spirituality to population ethics. In particular, I’ve tried to point at the possible role of stuff like mother love, loyalty, innocence, tragedy, and forgiveness.
I think I either don’t understand these bolded concepts (in the way that you use them here), or I do understand but they don’t resonate with me. (E.g., reading Chesterton’s quoted passage doesn’t seem to make me feel any loyalty or patriotism, or positivity, towards the universe.) In any case, they don’t seem to play much of a role in whether I related to the world with “existential positive” or “existential negative”. (I personally tend toward the neutral and negative sides but with a lot of uncertainty, based on uncertainty/guesses of what my values are or should be, and how human(-descended) civilization, and civilizations in general, seem likely to turn out.)
Would you say that relating to the world with some kind of “existential positive” (and these bolded concepts playing a role in that) is just a descriptive statement about your own psychology, or is it actually a normative statement that should also apply for other people, like me? (Your quoted statement above is literally just descriptive, but I wonder if you meant to give it a normative connotation, or would be willing to defend a normative version.)
For example, re: the virtues of liberalism/niceness/boundaries
I left a comment there, which you haven’t responded to yet. I also wonder if your positive intuitions towards liberalism/niceness/boundaries might not be based on history/experience with humans, which may well not apply to AGI due to potentially very different social/economic dynamics. See AGI will drastically increase economies of scale for one example of how some related intuitions could be invalidated.
**Me: **“Many people in ~2015 used to say that it would be hard to build an AGI that follows human values. Current instruction-tuned LLMs are essentially weak AGIs that follow human values. We should probably update based on this evidence.”
Please give some citations so I can check your memory/interpretation? One source I found is where Paul Christiano first talked about IDA (which he initially called ALBA) in early 2016, and most of the commenters there were willing to grant him the assumption of an aligned weak AGI and wanted to argue instead about the recursive “bootstraping” part. For example, my own comment started with:
I’m skeptical of the Bootstrapping Lemma. First, I’m assuming it’s reasonable to think of A1 as a human upload that is limited to one day of subjective time, by the end of which it must have written down any thoughts it wants to save, and be reset.
When Eliezer weighed in on IDA in 2018, he also didn’t object to the assumption of an aligned weak AGI and instead focused his skepticism on “preserving alignment while amplifying capabilities”.
I think this is a very natural frame, but here is an alternative. Humans largely do not have large scale or long term goals/values relative to which they can be said to be making mistakes, that AI will help prevent in the future. Instead they care almost exclusively about local and short term concerns like satisfying physical desires and winning status games, and the fate of our universe depends largely on humanity’s side effects as they go about acting on these parochial interests. (Consider e.g. Warren Buffett’s interest in accumulating wealth and disinterest in how it’s actually spent, suggesting that he only cares about it as a status marker.)
This state of affairs may well persist into the AGI era, with status games taking on even more importance as physical desires are largely satisfied/satiated. (This assumes we avoid a bunch of even worse failure modes.)
Since status games are competitive games with necessarily winners and losers, in this scenario it doesn’t seem to make sense to say that humans will stop making mistakes with AI assistance.