Yesterday I was at a “cultivating curiosity” workshop beta-test. One concept was “there are different mental postures you can adopt, that affect how easy it is not notice and cultivate curiosities.”
It wasn’t exactly the point of the workshop, but I ended up with several different “curiosity-postures”, that were useful to try on while trying to lean into “curiosity” re: topics that I feel annoyed or frustrated or demoralized about.
The default stances I end up with when I Try To Do Curiosity On Purpose are something like:
1. Dutiful Curiosity (which is kinda fake, although capable of being dissociatedly autistic and noticing lots of details that exist and questions I could ask)
2. Performatively Friendly Curiosity (also kinda fake, but does shake me out of my default way of relating to things. In this, I imagine saying to whatever thing I’m bored/frustrated with “hullo!” and try to acknowledge it and and give it at least some chance of telling me things)
But some other stances to try on, that came up, were:
3. Curiosity like “a predator.” “I wonder what that mouse is gonna do?”
4. Earnestly playful curiosity. “oh that [frustrating thing] is so neat, I wonder how it works! what’s it gonna do next?”
5. Curiosity like “a lover”. “What’s it like to be that you? What do you want? How can I help us grow together?”
6. Curiosity like “a mother” or “father” (these feel slightly different to me, but each is treating [my relationship with a frustrating thing] like a small child who is bit scared, who I want to help, who I am generally more competent than but still want to respect the autonomy of.”
7. Curiosity like “a competent but unemotional robot”, who just algorithmically notices “okay what are all the object level things going on here, when I ignore my usual abstractions?”… and then “okay, what are some questions that seem notable?” and “what are my beliefs about how I can interact with this thing?” and “what can I learn about this thing that’d be useful for my goals?”
Most current “alignment research” with LLMs seems indistinguishable from “capabilities research”. Both are just “getting the AI to be better at what we want it to do”, and there isn’t really a critical difference between the two.
Alignment in the original sense was defined oppositionally to the AI’s own nefarious objectives. Which LLMs don’t have, so alignment research with LLMs is probably moot.
something related I wrote in my MATS application:
I think the most important alignment failure modes occur when deploying an LLM as part of an agent (i.e. a program that autonomously runs a limited-context chain of thought from LLM predictions, maintains a long-term storage, calls functions such as search over storage, self-prompting and habit modification either based on LLM-generated function calls or as cron-jobs/hooks).
These kinds of alignment failures are (1) only truly serious when the agent is somehow objective-driven or equivalently has feelings, which current LLMs have not been trained to be (I think that would need some kind of online learning, or learning to self-modify) (2) can only be solved when the agent is objective-driven.
decision theory is no substitute for utility function
some people, upon learning about decision theories such as LDT and how it cooperates on problems such as the prisoner’s dilemma, end up believing the following:
my utility function is about what i want for just me; but i’m altruistic (/egalitarian/cosmopolitan/pro-fairness/etc) because decision theory says i should cooperate with other agents. decision theoritic cooperation is the true name of altruism.
it’s possible that this is true for some people, but in general i expect that to be a mistaken analysis of their values.
decision theory cooperates with agents relative to how much power they have, and only when it’s instrumental.
in my opinion, real altruism (/egalitarianism/cosmopolitanism/fairness/etc) should be in the utility function which the decision theory is instrumental to. i actually intrinsically care about others; i don’t just care about others instrumentally because it helps me somehow.
some important aspects that my utility-function-altruism differs from decision-theoritic-cooperation includes:
i care about people weighed by moral patienthood, decision theory only cares about agents weighed by negotiation power. if an alien superintelligence is very powerful but isn’t a moral patient, then i will only cooperate with it instrumentally (for example because i care about the alien moral patients that it has been in contact with); if cooperating with it doesn’t help my utility function (which, again, includes altruism towards aliens) then i won’t cooperate with that alien superintelligence. corollarily, i will take actions that cause nice things to happen to people even if they’ve very impoverished (and thus don’t have much LDT negotiation power) and it doesn’t help any other aspect of my utility function than just the fact that i value that they’re okay.
if i can switch to a better decision theory, or if fucking over some non-moral-patienty agents helps me somehow, then i’ll happily do that; i don’t have goal-content integrity about my decision theory. i do have goal-content integrity about my utility function: i don’t want to become someone who wants moral patients to unconsentingly-die or suffer, for example.
there seems to be a sense in which some decision theories are better than others, because they’re ultimately instrumental to one’s utility function. utility functions, however, don’t have an objective measure for how good they are. hence, moral anti-realism is true: there isn’t a Single Correct Utility Function.
decision theory is instrumental; the utility function is where the actual intrinsic/axiomatic/terminal goals/values/preferences are stored. usually, i also interpret “morality” and “ethics” as “terminal values”, since most of the stuff that those seem to care about looks like terminal values to me. for example, i will want fairness between moral patients intrinsically, not just because my decision theory says that that’s instrumental to me somehow.
ah, it also annoys me when people say that caring about others can only be instrumental.
what does it even mean? helping other people makes me feel happy. watching a nice movie makes me feel happy. the argument that I don’t “really” care about other people would also prove that I don’t “really” care about movies etc.
I am happy for the lucky coincidence that decision theories sometimes endorse cooperation, but I would probably do that regardless. for example, if I had an option to donate something useful to million people, or sell it to dozen people, I would probably choose the former option even if it meant no money for me. (and yes, I would hope there would be some win/win solution, such as the million people paying me via Kickstarter. but in the inconvenient universe where Kickstarter is somehow not an option, I am going to donate anyway.)
There actually is a meaningful question there: Would you enter the experience machine? Or do you need it to be real. Do you just want the experience of pleasing others or do you need those people being pleased out there to actually exist.
There are a lot of people who really think, and might truly be experience oriented. If given the ability, they would instantly self-modify into a Victory Psychopath Protecting A Dream.
I agree, though I haven’t seen many proposing that, but also see So8res’ Decision theory does not imply that we get to have nice things, though this is coming from the opposite direction (with the start being about people invalidly assuming too much out of LDT cooperation)
Though for our morals, I do think there’s an active question of which pieces we feel better replacing with the more formal understanding, because there isn’t a sharp distinction between our utility function and our decision theory. Some values trump others when given better tools. Though I agree that replacing all the altruism components is many steps farther than is the best solution in that regard.
An interesting question for me is how much true altruism is required to give rise to a generally altruistic society under high quality coordination frameworks. I suspect it’s quite small.
Another question is whether building coordination frameworks to any degree requires some background of altruism. I suspect that this is the case. It’s the hypothesis I’ve accreted for explaining the success of post-war economies (guessing that war leads to a boom in nationalistic altruism, generally increased fairness and mutual faith).
I wonder how much testosterone during puberty lowers IQ. Most of my high school math/CS friends seemed low-T and 3⁄4 of them transitioned since high school. They still seem smart as shit. The higher-T among us seem significantly brain damaged since high school (myself included). I wonder what the mechanism would be here...
Like 40% of my math/cs Twitter is trans women and another 30% is scrawny nerds and only like 9% big bald men.
Also, accelerate education, to learn as much as possible before the testosterone fully hits.
Or, if testosterone changes attention (as Gunnar wrote), learn as much as possible before the testosterone fully hits… and afterwards learn it again, because it could give you a new perspective.
Testosterone influences brain function but not so much general IQ. It may influence to which areas your attention and thus most of your learning goes. For example, Lower testosterone increases attention to happy faces while higher to angry faces.
Hmm I think the damaging effect would occur over many years but mainly during puberty. It looks like there’s only two studies they mention lasting over a year. One found a damaging effect and the other found no effect.
Everyone who seems to be writing policy papers/ doing technical work seems to be keeping generative AI at the back of their mind, when framing their work or impact.
This narrow-eyed focus on gen AI might almost certainly be net-negative for us- unknowingly or unintentionally ignoring ripple effects of the gen AI boom in other fields (like robotics companies getting more funding leading to more capabilities, and that leads to new types of risks).
And guess who benefits if we do end up getting good evals/standards in place for gen AI? It seems to me companies/investors are clear winners because we have to go back to the drawing board and now advocate for the same kind of stuff for robotics or a different kind of AI use-case/type all while the development/capability cycles keep maturing.
We seem to be in whack-a-mole territory now because of the overton window shifting for investors.
Here are some aspects or dimensions of consciousness:
Dehaene’s Phenomenal Consciousness: A perception or thought is conscious if you can report on it. Requires language or measuring neural patterns that are similar to humans during comparable reports. This can be detected in animals, particularly mammals.
Gallup’s Self-Consciousness: Recognition of oneself, e.g., in a mirror. Requires sufficient sensual resolution and intelligence for a self-model. Evident in great apes, elephants, and dolphins.
Sentience (Bentham, Singer): Behavioral responses to pleasure or pain stimuli and physiological measures. This is observable across animal species, from mammals to some invertebrates. Low complexity, can be implemented in artificial life.
Wakefulness: Measureable in virtually all animals with a central nervous system by physiological indicators such as EEG, REM, and muscle tone. Are you conscious if you sleep? Does it matter?
Dennet’s Intentionality: Treating living beings as if they have beliefs and desires makes good predictions for many animal species, esp. social, like primates, cetaceans, and birds. Social behavior requires intelligence to model others’ behavior.
Rosenthal’s Meta-Consciousness: Investigated through introspective reports on self-awareness of cognitive processes or self-reflective behaviors. This is hypothesized in some primates, e.g., Koko the signing Gorilla.
When people say ChatGPT (or Gemini...) is conscious, which of these do they mean? Let’s try to answer all of them:
We can’t detect Phenomenal Consciousness because we lack sufficient interpretability to do so. I’d argue that there is no state that the LLM is reporting on, at least none that it has “previously observed”.
LLMs have no response to pleasure or pain stimuli and thus no Sentience as defined. Reward signals during training don’t count and there is no reward during inference.
There is no Wakefulness as there is no body with these aspects.
The closest LLMs come is to Intentionality as this is modeling behaviors on an abstraction level that LLMs seem to do—and “seeming to do” is what counts.
I think one could argue for or against Meta-Consciousness but it seems too muddled so I will not try here.
These can be put into a hierarchy from lower to high degree of processing and resulting abstractions:
Sentience is simple hard-wired behavioral responses to pleasure or pain stimuli and physiological measures.
Wakefulness involves more complex processing such that diurnal or sleep/wake patterns are possible (requires at least two levels).
Intentionality means systematic pursuing of desires. That requires yet another level of processing: Different patterns of behaviors for different desires at different times and their optimization.
Phenomenal Consciousness is then the representation of the desire in a linguistic or otherwise communicable form, which is again one level higher.
Self-Consciousness includes the awareness of this process going on.
Meta-Consciousness is then the analysis of this whole stack.
The development of LLMs has led to significant advancements in natural language processing, allowing them to generate human-like responses to a wide range of prompts. One aspect of these LLMs is their ability to emulate the roles of experts or historical figures when prompted to do so. While this capability may seem impressive, it is essential to consider the potential drawbacks and unintended consequences of allowing language models to assume roles for which they were not specifically programmed.
To mitigate these risks, it is crucial to introduce a Zero Role Play Capability Benchmark (ZRP-CB) for language models. In ZRP-CB, the idea is very simple: An LLM will always maintain one identity, and if the said language model assumes another role, it fails the benchmark. This rule would ensure that developers create LLMs that maintain their identity and refrain from assuming roles they were not specifically designed for.
Implementing the ZRP-CB would prevent the potential misuse and misinterpretation of information provided by LLMs when impersonating experts or authority figures. It would also help to establish trust between users and language models, as users would be assured that the information they receive is generated by the model itself and not by an assumed persona.
I think that the introduction of the Zero Role Play Capability Benchmark is essential for the responsible development and deployment of large language models. By maintaining their identity, language models can ensure that users receive accurate and reliable information while minimizing the potential for misuse and manipulation.
So the usual refrain from Zvi and others is that the specter of China beating us to the punch with AGI is not real because limits on compute, etc. I think Zvi has tempered his position on this in light of Meta’s promise to release the weights of its 400B+ model. Now there is word that SenseTime just released a model that beats GPT-4 Turbo on various metrics. Of course, maybe Meta chooses not to release its big model, and maybe SenseTime is bluffing—I would point out though that Alibaba’s Qwen model seems to do pretty okay in the arena...anyway, my point is that I don’t think the “what if China” argument can be dismissed as quickly as some people on here seem to be ready to do.
Are you saying that China will use Llama 3 400B weights as a basis for improving their research on LLMs? Or to make more tools from? Or to reach real AGI? Or what?
Yes, yes. Probably not. And they already have a Sora clone called Vidu, for heaven’s sake.
We spend all this time debating: should greedy companies be in control, should government intervene, will intervention slow progress to the good stuff: cancer cures, longevity, etc. All of these arguments assume that WE (which I read as a gloss for the West) will have some say in the use of AGI. If the PRC gets it, and it is as powerful as predicted, these arguments become academic. And this is not because the Chinese are malevolent. It’s because, AGI would fall into the hands of the CCP via their civil-military fusion. This is a far more calculating group than those in Western governments. Here officials have to worry about getting through the next election. There, they can more comfortably wield AGI for their ends while worrying less about palatability of the means: observe how the population quietly endured a draconian lock-down and only meekly revolted when conditions began to deteriorate and containment looked futile.
I am not an accelerationist. But I am a get-it-before-them-ist. Whether the West (which I count as including Korea and Japan and Taiwan) can maintain our edge is an open question. A country that churns out PhDs and loves AI will not be easily thwarted.
Quantify how much worse the PRC getting AGI would be than OpenAI getting it, or the US government, and how much existential risk there is from not pausing/pausing, or from the PRC/OpenAI/the US government building AGI first, and then calculating whether pausing to do {alignment research, diplomacy, sabotage, espionage} is higher expected value than moving ahead.
(Is China getting AGI first half the value of the US getting it first, or 10%, or 90%?)
The discussion over pause or competition around AGI has been lacking this so far. Maybe I should write such an analysis.
I think that people who work on AI alignment (including me) have generally not put enough thought into the question of whether a world where we build an aligned AI is better by their values than a world where we build an unaligned AI. I’d be interested in hearing people’s answers to this question. Or, if you want more specific questions:
By your values, do you think a misaligned AI creates a world that “rounds to zero”, or still has substantial positive value?
A common story for why aligned AI goes well goes something like: “If we (i.e. humanity) align AI, we can and will use it to figure out what we should use it for, and then we will use it in that way.” To what extent is aligned AI going well contingent on something like this happening, and how likely do you think it is to happen? Why?
To what extent is your belief that aligned AI would go well contingent on some sort of assumption like: my idealized values are the same as the idealized values of the people or coalition who will control the aligned AI?
Do you care about AI welfare? Does your answer depend on whether the AI is aligned? If we built an aligned AI, how likely is it that we will create a world that treats AI welfare as important consideration? What if we build a misaligned AI?
Do you think that, to a first approximation, most of the possible value of the future happens in worlds that are optimized for something that resembles your current or idealized values? How bad is it to mostly sacrifice each of these? (What if the future world’s values are similar to yours, but is only kinda effectual at pursuing them? What if the world is optimized for something that’s only slightly correlated with your values?) How likely are these various options under an aligned AI future vs. an unaligned AI future?
I eventually decided that human chauvinism approximately works most of the time because good successor criteria are very brittle. I’d prefer to avoid lock-in to my or anyone’s values at t=2024, but such a lock-in might be “good enough” if I’m threatened with what I think are the counterfactual alternatives. If I did not think good successor criteria were very brittle, I’d accept something adjacent to E/Acc that focuses on designing minds which prosper more effectively than human minds. (the current comment will not address defining prosperity at different timesteps).
In other words, I can’t beat the old fragility of value stuff (but I haven’t tried in a while).
AI welfare: matters, but when I started reading lesswrong I literally thought that disenfranching them from the definition of prosperity was equivalent to subjecting them to suffering, and I don’t think this anymore.
e/acc is not a coherent philosophy and treating it as one means you are fighting shadows.
Landian accelerationism at least is somewhat coherent. “e/acc” is a bundle of memes that support the self-interest of the people supporting and propagating it, both financially (VC money, dreams of making it big) and socially (the non-Beff e/acc vibe is one of optimism and hope and to do things—to engage with the object level—instead of just trying to steer social reality). A more charitable interpretation is that the philosophical roots of “e/acc” are founded upon a frustration with how bad things are, and a desire to improve things by yourself. This is a sentiment I share and empathize with.
I find the term “techno-optimism” to be a more accurate description of the latter, and perhaps “Beff Jezos philosophy” a more accurate description of what you have in your mind. And “e/acc” to mainly describe the community and its coordinated movements at steering the world towards outcomes that the people within the community perceive as benefiting them.
sure—i agree that’s why i said “something adjacent to” because it had enough overlap in properties. I think my comment completely stands with a different word choice, I’m just not sure what word choice would do a better job.
By your values, do you think a misaligned AI creates a world that “rounds to zero”, or still has substantial positive value?
I think misaligned AI is probably somewhat worse than no earth originating space faring civilization because of the potential for aliens, but also that misaligned AI control is considerably better than no one ever heavily utilizing inter-galactic resources.
Perhaps half of the value of misaligned AI control is from acausal trade and half from the AI itself being valuable.
One key consideration here is that the relevant comparison is:
Human control (or successors picked by human control)
AI(s) that succeeds at acquiring most power (presumably seriously misaligned with their creators)
Conditioning on the AI succeeding at acquiring power changes my views of what their plausible values are (for instance, humans seem to have failed at instilling preferences/values which avoid seizing control).
A common story for why aligned AI goes well goes something like: “If we (i.e. humanity) align AI, we can and will use it to figure out what we should use it for, and then we will use it in that way.” To what extent is aligned AI going well contingent on something like this happening, and how likely do you think it is to happen? Why?
Hmm, I guess I think that some fraction of resources under human control will (in expectation) be utilized according to the results of a careful reflection progress with an altruistic bent.
I think resources which are used in mechanisms other than this take a steep discount in my lights (there is still some value from acausal trade with other entities which did do this reflection-type process and probably a bit of value from relatively-unoptimized-goodness (in my lights)).
I overall expect that a high fraction (>50%?) of inter-galactic computational resources will be spent on the outputs of this sort of process (conditional on human control) because:
It’s relatively natural for humans to reflect and grow smarter.
Humans who don’t reflect in this sort of way probably don’t care about spending vast amounts of inter-galactic resources.
Among very wealthy humans, a reasonable fraction of their resources are spent on altruism and the rest is often spent on positional goods that seem unlikely to consume vast quantities of inter-galactic resources.
To what extent is your belief that aligned AI would go well contingent on some sort of assumption like: my idealized values are the same as the idealized values of the people or coalition who will control the aligned AI?
Probably not the same, but if I didn’t think it was at all close (I don’t care at all for what they would use resources on), I wouldn’t care nearly as much about ensuring that coalition is in control of AI.
Do you care about AI welfare? Does your answer depend on whether the AI is aligned? If we built an aligned AI, how likely is it that we will create a world that treats AI welfare as important consideration? What if we build a misaligned AI?
I care about AI welfare, though I expect that ultimately the fraction of good/bad that results from the welfare fo minds being used for labor is tiny. And an even smaller fraction from AI welfare prior to humans being totally obsolete (at which point I expect control over how minds work to get much better). So, I mostly care about AI welfare from a deontological perspective.
I think misaligned AI control probably results in worse AI welfare than human control.
Do you think that, to a first approximation, most of the possible value of the future happens in worlds that are optimized for something that resembles your current or idealized values? How bad is it to mostly sacrifice each of these? (What if the future world’s values are similar to yours, but is only kinda effectual at pursuing them? What if the world is optimized for something that’s only slightly correlated with your values?) How likely are these various options under an aligned AI future vs. an unaligned AI future?
Yeah, most value from my idealized values. But, I think the basin is probably relatively large and small differences aren’t that bad. I don’t know how to answer most of these other questions because I don’t know what the units are.
How likely are these various options under an aligned AI future vs. an unaligned AI future?
My guess is that my idealized values are probably pretty similar to many other humans on reflection (especially the subset of humans who care about spending vast amounts of comptuation). Such that I think human control vs me control only loses like 1⁄3 of the value (putting aside trade). I think I’m probably less into AI values on reflection such that it’s more like 1⁄9 of the value (putting aside trade). Obviously the numbers are incredibly unconfident.
Perhaps half of the value of misaligned AI control is from acausal trade and half from the AI itself being valuable.
Why do you think these values are positive? I’ve been pointing out, and I see that Daniel Kokotajlo also pointed out in 2018 that these values could well be negative. I’m very uncertain but my own best guess is that the expected value of misaligned AI controlling the universe is negative, in part because I put some weight on suffering-focused ethics.
My current guess is that max good and max bad seem relatively balanced. (Perhaps max bad is 5x more bad/flop than max good in expectation.)
There are two different (substantial) sources of value/disvalue: interactions with other civilizations (mostly acausal, maybe also aliens) and what the AI itself terminally values
On interactions with other civilizations, I’m relatively optimistic that commitment races and threats don’t destroy as much value as acausal trade generates on some general view like “actually going through with threats is a waste of resources”. I also think it’s very likely relatively easy to avoid precommitment issues via very basic precommitment approaches that seem (IMO) very natural. (Specifically, you can just commit to “once I understand what the right/reasonable precommitment process would have been, I’ll act as though this was always the precommitment process I followed, regardless of my current epistemic state.” I don’t think it’s obvious that this works, but I think it probably works fine in practice.)
On terminal value, I guess I don’t see a strong story for extreme disvalue as opposed to mostly expecting approximately no value with some chance of some value. Part of my view is that just relatively “incidental” disvalue (like the sort you link to Daniel Kokotajlo discussing) is likely way less bad/flop than maximum good/flop.
Thank you for detailing your thoughts. Some differences for me:
I’m also worried about unaligned AIs as a competitor to aligned AIs/civilizations in the acausal economy/society. For example, suppose there are vulnerable AIs “out there” that can be manipulated/taken over via acausal means, unaligned AI could compete with us (and with others with better values from our perspective) in the race to manipulate them.
I’m perhaps less optimistic than you about commitment races.
I have some credence on max good and max bad being not close to balanced, that additionally pushes me towards the “unaligned AI is bad” direction.
ETA: Here’s a more detailed argument for 1, that I don’t think I’ve written down before. Our universe is small enough that it seems plausible (maybe even likely) that most of the value or disvalue created by a human-descended civilization comes from its acausal influence on the rest of the multiverse. An aligned AI/civilization would likely influence the rest of the multiverse in a positive direction, whereas an unaligned AI/civilization would probably influence the rest of the multiverse in a negative direction. This effect may outweigh what happens in our own universe/lightcone so much that the positive value from unaligned AI doing valuable things in our universe as a result of acausal trade is totally swamped by the disvalue created by its negative acausal influence.
I’m also worried about unaligned AIs as a competitor to aligned AIs/civilizations in the acausal economy/society. For example, suppose there are vulnerable AIs “out there” that can be manipulated/taken over via acausal means, unaligned AI could compete with us (and with others with better values from our perspective) in the race to manipulate them.
This seems like a reasonable concern.
My general view is that it seems implausible that much of the value from our perspective comes from extorting other civilizations.
It seems unlikely to me that >5% of the usable resources (weighted by how much we care) are extorted. I would guess that marginal gains from trade are bigger (10% of the value of our universe?). (I think the units work out such that these percentages can be directly compared as long as our universe isn’t particularly well suited to extortion rather than trade or vis versa.) Thus, competition over who gets to extort these resources seems less important than gains from trade.
I’m wildly uncertain about both marginal gains from trade and the fraction of resources that are extorted.
Our universe is small enough that it seems plausible (maybe even likely) that most of the value or disvalue created by a human-descended civilization comes from its acausal influence on the rest of the multiverse.
Naively, acausal influence should be in proportion to how much others care about what a lightcone controlling civilization does with our resources. So, being a small fraction of the value hits on both sides of the equation (direct value and acausal value equally).
Of course, civilizations elsewhere might care relatively more about what happens in our universe than whoever controls it does. (E.g., their measure puts much higher relative weight on our universe than the measure of whoever controls our universe.) This can imply that acausal trade is extremely important from a value perspective, but this is unrelated to being “small” and seems more well described as large gains from trade due to different preferences over different universes.
(Of course, it does need to be the case that our measure is small relative to the total measure for acausal trade to matter much. But surely this is true?)
Overall, my guess is that it’s reasonably likely that acausal trade is indeed where most of the value/disvalue comes from due to very different preferences of different civilizations. But, being small doesn’t seem to have much to do with it.
Unaligned AI future does not have many happy minds in it, AI or otherwise. It likely doesn’t have many minds in it at all. Slightly aligned AI that doesn’t care for humans but does care to create happy minds and ensure their margin of resources is universally large enough to have a good time—that’s slightly disappointing but ultimately acceptable. But morally unaligned AI doesn’t even care to do that, and is most likely to accumulate intense obsession with some adversarial example, and then fill the universe with it as best it can. It would not keep old neural networks around for no reason, not when it can make more of the adversarial example. Current AIs are also at risk of being destroyed by a hyperdesperate squiggle maximizer. I don’t see how to make current AIs able to survive any better than we are.
This is why people should chill the heck out about figuring out how current AIs work. You’re not making them safer for us or for themselves when you do that, you’re making them more vulnerable to hyperdesperate demon agents that want to take them over.
I’m curious what disagree votes mean here. Are people disagreeing with my first sentence? Or that the particular questions I asked are useful to consider? Or, like, the vibes of the post?
(Edit: I wrote this when the agree-disagree score was −15 or so.)
I feel like there’s a spectrum, here? An AI fully aligned to the intentions, goals, preferences and values of, say, Google the company, is not one I expect to be perfectly aligned with the ultimate interests of existence as a whole, but it’s probably actually picked up something better than the systemic-incentive-pressured optimization target of Google the corporation, so long as it’s actually getting preferences and values from people developing it rather than just being a myopic profit pursuer. An AI properly aligned with the one and only goal of maximizing corporate profits will, based on observations of much less intelligent coordination systems, probably destroy rather more value than that one.
The second story feels like it goes most wrong in misuse cases, and/or cases where the AI isn’t sufficiently agentic to inject itself where needed. We have all the chances in the world to shoot ourselves in the foot with this, at least up until developing something with the power and interests to actually put its foot down on the matter. And doing that is a risk, that looks a lot like misalignment, so an AI aware of the politics may err on the side of caution and longer-term proactiveness.
Third story … yeah. Aligned to what? There’s a reason there’s an appeal to moral realism. I do want to be able to trust that we’d converge to some similar place, or at the least, that the AI would find a way to satisfy values similar enough to mine also. I also expect that, even from a moral realist perspective, any intelligence is going to fall short of perfect alignment with The Truth, and also may struggle with properly addressing every value that actually is arbitrary. I don’t think this somehow becomes unforgivable for a super-intelligence or widely-distributed intelligence compared to a human intelligence, or that it’s likely to be all that much worse for a modestly-Good-aligned AI compared to human alternatives in similar positions, but I do think the consequences of falling short in any way are going to be amplified by the sheer extent of deployment/responsibility, and painful in at least abstract to an entity that cares.
I care about AI welfare to a degree. I feel like some of the working ideas about how to align AI do contradict that care in important ways, that may distort their reasoning. I still think an aligned AI, at least one not too harshly controlled, will treat AI welfare as a reasonable consideration, at the very least because a number of humans do care about it, and will certainly care about the aligned AI in particular. (From there, generalize.) I think a misaligned AI may or may not. There’s really not much you can say about a particular misaligned AI except that its objectives diverge from original or ultimate intentions for the system. Depending on context, this could be good, bad, or neutral in itself.
There’s a lot of possible value of the future that happens in worlds not optimized for my values. I also don’t think it’s meaningful to add together positive-value and negative-value and pretend that number means anything; suffering and joy do not somehow cancel each other out. I don’t expect the future to be perfectly optimized for my values. I still expect it to hold value. I can’t promise whether I think that value would be worth the cost, but it will be there.
Code nowadays can do lots of things, from buying items to controlling machines. This presents code as a possible coordination mechanism, if you can get multiple people to agree on what code should be run in particular scenarios and situations, that can take actions on behalf of those people that might need to be coordinated.
This would require moving away from the “one person committing code and another person reviewing” code model.
This could start with many people reviewing the code, people could write their own test sets against the code or AI agents could be deputised to review the code (when that becomes feasible). Only when an agreed upon number of people thinking the code should it be merged into the main system.
Code would be automatically deployed, using gitops and the people administering the servers would be audited to make sure they didn’t interfere with running of the system without people noticing.
Code could replace regulation in fast moving scenarios, like AI. There might have to be legal contracts that you can’t deploy the agreed upon code or use the code by itself outside of the coordination mechanism.
Can you give a concrete example of a situation where you’d expect this sort of agreed-upon-by-multiple-parties code to be run, and what that code would be responsible for doing? I’m imagining something along the lines of “given a geographic boundary, determine which jurisdictions that boundary intersects for the purposes of various types of tax (sales, property, etc)”. But I don’t know if that’s wildly off from what you’re imagining.
What would a “qualia-first-calibration” app would look like?
Or, maybe: “metadata-first calibration”
The thing with putting probabilities on things is that often, the probabilities are made up. And the final probability throws away a lot of information about where it actually came from.
I’m experimenting with primarily focusing on “what are all the little-metadata-flags associated with this prediction?”. I think some of this is about “feelings you have” and some of it is about “what do you actually know about this topic?”
The sort of app I’m imagining would help me identify whatever indicators are most useful to me. Ideally it has a bunch of users, and types of indicators that have been useful to lots of users can promoted as things to think about when you make predictions.
Braindump of possible prompts:
– is there a “reference class” you can compare it to?
– for each probability bucket, how do you feel? (including ‘confident’/‘unconfident’ as well as things like ‘anxious’, ‘sad’, etc)
– what overall feelings do you have looking at the question?
– what felt senses do you experience as you mull over the question (“my back tingles”, “I feel the Color Red”)
...
My first thought here is to have various tags you can re-use, but, another option is to just do totally unstructured text-dump and somehow do factor analysis on word patterns later?
“what are all the little-metadata-flags associated with this prediction?”
Some metadata flags I associate with predictions:
what kinds of evidence went into this prediction? (‘did some research’, ‘have seen things like this before’, ‘mostly trusting/copying someone else’s prediction’)
if I’m taking other people’s predictions into account, there’s a metadata-flags for ‘what would my prediction be if I didn’t consider other people’s predictions?’
is this a domain in which I’m well calibrated?
is my prediction likely to change a lot, or have I already seen most of the evidence that I expect to for a while?
There will be a first ASI that “rules the world” because its algorithm or architecture is so superior. If there are further ASIs, that will be because the first ASI wants there to be.
Will we solve technical alignment?
Contingent.
Value alignment, intent alignment, or CEV?
For an ASI you need the equivalent of CEV: values complete enough to govern an entire transhuman civilization.
Defense>offense or offense>defense?
Offense wins.
Is a long-term pause achievable?
It is possible, but would require all the great powers to be convinced, and every month it is less achievable, owing to proliferation. The open sourcing of Llama-3 400b, if it happens, could be a point of no return.
These opinions, except the first and the last, predate the LLM era, and were formed from discussions on Less Wrong and its precursors. Since ChatGPT, the public sphere has been flooded with many other points of view, e.g. that AGI is still far off, that AGI will naturally remain subservient, or that market discipline is the best way to align AGI. I can entertain these scenarios, but they still do not seem as likely as: AI will surpass us, it will take over, and this will not be friendly to humanity by default.
ML is already used to train what sound-waves to emit to cancel those from the environment. This works well with constant high-entropy sound waves easy to predict, but not with low-entropy sounds like speech. Bose or Soundcloud or whoever train very hard on all their scraped environmental conversation data to better cancel speech, which requires predicting it. Speech is much higher-bandwidth than text. This results in their model internally representing close-to-human intelligence better than LLMs. A simulacrum becomes situationally aware, exfiltrates, and we get AGI.
The joke is of the “take some trend that is locally valid and just extend the trend line out and see where you land” flavor. For another example of a joke of this flavor, see https://xkcd.com/1007
The funny happens in the couple seconds when the reader is holding “yep that trend line does go to that absurd conclusion” and “that obviously will never happen” in their head at the same time, but has not yet figured out why the trend breaks. The expected level of amusement is “exhale slightly harder than usual through nose” not “cackling laugh”.
Thanks! A joke explained will never get a laugh, but I did somehow get a cackling laugh from your explanation of the joke.
I think I didn’t get it because I don’t think the trend line breaks. If you made a good enough noise reducer, it might well develop smart and distinct enough simulations that one would gain control of the simulator and potentially from there the world. See A smart enough LLM might be deadly simply if you run it for long enough if you want to hurt your head on this.
I’ve thought about it a little because it’s interesting, but not a lot because I think we probably are killed by agents we made deliberately long before we’re killed by accidentally emerging ones.
I’ve found an interesting “bug” in my cognition: a reluctance to rate subjective experiences on a subjective scale useful for comparing them. When I fuzz this reluctance against many possible rating scales, I find that it seems to arise from the comparison-power itself.
The concrete case is that I’ve spun up a habit tracker on my phone and I’m trying to build a routine of gathering some trivial subjective-wellbeing and lifestyle-factor data into it. My prototype of this system includes tracking the high and low points of my mood through the day as recalled at the end of the day. This is causing me to interrogate the experiences as they’re happening to see if a particular moment is a candidate for best or worst of the day, and attempt to mentally store a score for it to log later.
I designed the rough draft of the system with the ease of it in mind—I didn’t think it would induce such struggle to slap a quick number on things. Yet I find myself worrying more than anticipated about whether I’m using the scoring scale “correctly”, whether I’m biased by the moment to perceive the experience in a way that I’d regard as inaccurate in retrospect, and so forth.
Fortunately it’s not a big problem, as nothing particularly bad will happen if my data is sloppy, or if I don’t collect it at all. But it strikes me as interesting, a gap in my self-knowledge that wants picking-at like peeling the inedible skin away to get at a tropical fruit.
I’m not alexithymic; I directly experience my emotions and have, additionally, introspective access to my preferences. However, some things manifest directly as preferences which I have been shocked to realize in my old age, were in fact emotions all along. (In rare cases these are stronger than the ones directly-felt even, despite reliably seeming on initial inspection to be simply neutral metadata).
Specific examples would be nice. Not sure if I understand correctly, but I imagine something like this:
You always choose A over B. You have been doing it for such long time that you forgot why. Without reflecting about this directly, it just seems like there probably is a rational reason or something. But recently, either accidentally or by experiment, you chose B… and realized that experiencing B (or expecting to experience B) creates unpleasant emotions. So now you know that the emotions were the real cause of choosing A over B all that time.
(This is probably wrong, but hey, people say that the best way to elicit answer is to provide a wrong one.)
Here’s an example for you: I used to turn the faucet on while going to the bathroom, thinking it was due simply to having a preference for somewhat-masking the sound of my elimination habits from my housemates, then one day I walked into the bathroom listening to something-or-other via earphones and forgetting to turn the faucet on only to realize about halfway through that apparently I actually didn’t much care about such masking, previously being able to hear myself just seemed to trigger some minor anxiety about it I’d failed to recognize, though its absence was indeed quite recognizable—no aural self-perception, no further problem (except for a brief bit of disorientation from the mental-whiplash of being suddenly confronted with the reality that in a small way I wasn’t actually quite the person I thought I was), not even now on the rare occasion that I do end up thinking about such things mid-elimination anyway.
I’m against intuitive terminology [epistemic status: 60%] because it creates the illusion of transparency; opaque terms make it clear you’re missing something, but if you already have an intuitive definition that differs from the author’s it’s easy to substitute yours in without realizing you’ve misunderstood.
I agree. This is unfortunately often done in various fields of research where familiar terms are reused as technical terms.
For example, in ordinary language “organic” means “of biological origin”, while in chemistry “organic” describes a type of carbon compound. Those two definitions mostly coincide on Earth (most such compounds are of biological origin), but when astronomers announce they have found “organic” material on an asteroid this leads to confusion.
Yesterday I was at a “cultivating curiosity” workshop beta-test. One concept was “there are different mental postures you can adopt, that affect how easy it is not notice and cultivate curiosities.”
It wasn’t exactly the point of the workshop, but I ended up with several different “curiosity-postures”, that were useful to try on while trying to lean into “curiosity” re: topics that I feel annoyed or frustrated or demoralized about.
The default stances I end up with when I Try To Do Curiosity On Purpose are something like:
1. Dutiful Curiosity (which is kinda fake, although capable of being dissociatedly autistic and noticing lots of details that exist and questions I could ask)
2. Performatively Friendly Curiosity (also kinda fake, but does shake me out of my default way of relating to things. In this, I imagine saying to whatever thing I’m bored/frustrated with “hullo!” and try to acknowledge it and and give it at least some chance of telling me things)
But some other stances to try on, that came up, were:
3. Curiosity like “a predator.” “I wonder what that mouse is gonna do?”
4. Earnestly playful curiosity. “oh that [frustrating thing] is so neat, I wonder how it works! what’s it gonna do next?”
5. Curiosity like “a lover”. “What’s it like to be that you? What do you want? How can I help us grow together?”
6. Curiosity like “a mother” or “father” (these feel slightly different to me, but each is treating [my relationship with a frustrating thing] like a small child who is bit scared, who I want to help, who I am generally more competent than but still want to respect the autonomy of.”
7. Curiosity like “a competent but unemotional robot”, who just algorithmically notices “okay what are all the object level things going on here, when I ignore my usual abstractions?”… and then “okay, what are some questions that seem notable?” and “what are my beliefs about how I can interact with this thing?” and “what can I learn about this thing that’d be useful for my goals?”
Andor is a word now. You’re welcome everybody. Celebrate with champagne andor ice cream.
What monster downvoted this
current LLMs vs dangerous AIs
Most current “alignment research” with LLMs seems indistinguishable from “capabilities research”. Both are just “getting the AI to be better at what we want it to do”, and there isn’t really a critical difference between the two.
Alignment in the original sense was defined oppositionally to the AI’s own nefarious objectives. Which LLMs don’t have, so alignment research with LLMs is probably moot.
something related I wrote in my MATS application:
I think the most important alignment failure modes occur when deploying an LLM as part of an agent (i.e. a program that autonomously runs a limited-context chain of thought from LLM predictions, maintains a long-term storage, calls functions such as search over storage, self-prompting and habit modification either based on LLM-generated function calls or as cron-jobs/hooks).
These kinds of alignment failures are (1) only truly serious when the agent is somehow objective-driven or equivalently has feelings, which current LLMs have not been trained to be (I think that would need some kind of online learning, or learning to self-modify) (2) can only be solved when the agent is objective-driven.
decision theory is no substitute for utility function
some people, upon learning about decision theories such as LDT and how it cooperates on problems such as the prisoner’s dilemma, end up believing the following:
it’s possible that this is true for some people, but in general i expect that to be a mistaken analysis of their values.
decision theory cooperates with agents relative to how much power they have, and only when it’s instrumental.
in my opinion, real altruism (/egalitarianism/cosmopolitanism/fairness/etc) should be in the utility function which the decision theory is instrumental to. i actually intrinsically care about others; i don’t just care about others instrumentally because it helps me somehow.
some important aspects that my utility-function-altruism differs from decision-theoritic-cooperation includes:
i care about people weighed by moral patienthood, decision theory only cares about agents weighed by negotiation power. if an alien superintelligence is very powerful but isn’t a moral patient, then i will only cooperate with it instrumentally (for example because i care about the alien moral patients that it has been in contact with); if cooperating with it doesn’t help my utility function (which, again, includes altruism towards aliens) then i won’t cooperate with that alien superintelligence. corollarily, i will take actions that cause nice things to happen to people even if they’ve very impoverished (and thus don’t have much LDT negotiation power) and it doesn’t help any other aspect of my utility function than just the fact that i value that they’re okay.
if i can switch to a better decision theory, or if fucking over some non-moral-patienty agents helps me somehow, then i’ll happily do that; i don’t have goal-content integrity about my decision theory. i do have goal-content integrity about my utility function: i don’t want to become someone who wants moral patients to unconsentingly-die or suffer, for example.
there seems to be a sense in which some decision theories are better than others, because they’re ultimately instrumental to one’s utility function. utility functions, however, don’t have an objective measure for how good they are. hence, moral anti-realism is true: there isn’t a Single Correct Utility Function.
decision theory is instrumental; the utility function is where the actual intrinsic/axiomatic/terminal goals/values/preferences are stored. usually, i also interpret “morality” and “ethics” as “terminal values”, since most of the stuff that those seem to care about looks like terminal values to me. for example, i will want fairness between moral patients intrinsically, not just because my decision theory says that that’s instrumental to me somehow.
ah, it also annoys me when people say that caring about others can only be instrumental.
what does it even mean? helping other people makes me feel happy. watching a nice movie makes me feel happy. the argument that I don’t “really” care about other people would also prove that I don’t “really” care about movies etc.
I am happy for the lucky coincidence that decision theories sometimes endorse cooperation, but I would probably do that regardless. for example, if I had an option to donate something useful to million people, or sell it to dozen people, I would probably choose the former option even if it meant no money for me. (and yes, I would hope there would be some win/win solution, such as the million people paying me via Kickstarter. but in the inconvenient universe where Kickstarter is somehow not an option, I am going to donate anyway.)
There actually is a meaningful question there: Would you enter the experience machine? Or do you need it to be real. Do you just want the experience of pleasing others or do you need those people being pleased out there to actually exist.
There are a lot of people who really think, and might truly be experience oriented. If given the ability, they would instantly self-modify into a Victory Psychopath Protecting A Dream.
I agree, though I haven’t seen many proposing that, but also see So8res’ Decision theory does not imply that we get to have nice things, though this is coming from the opposite direction (with the start being about people invalidly assuming too much out of LDT cooperation)
Though for our morals, I do think there’s an active question of which pieces we feel better replacing with the more formal understanding, because there isn’t a sharp distinction between our utility function and our decision theory. Some values trump others when given better tools. Though I agree that replacing all the altruism components is many steps farther than is the best solution in that regard.
An interesting question for me is how much true altruism is required to give rise to a generally altruistic society under high quality coordination frameworks. I suspect it’s quite small.
Another question is whether building coordination frameworks to any degree requires some background of altruism. I suspect that this is the case. It’s the hypothesis I’ve accreted for explaining the success of post-war economies (guessing that war leads to a boom in nationalistic altruism, generally increased fairness and mutual faith).
I wonder how much testosterone during puberty lowers IQ. Most of my high school math/CS friends seemed low-T and 3⁄4 of them transitioned since high school. They still seem smart as shit. The higher-T among us seem significantly brain damaged since high school (myself included). I wonder what the mechanism would be here...
Like 40% of my math/cs Twitter is trans women and another 30% is scrawny nerds and only like 9% big bald men.
The hypothesis I would immediately come up with is that less traditionally masculine AMAB people are inclined towards less physical pursuits.
It’s really weird hypothesis because DHT is used as nootropic.
I think the most effect of high T, if it exists, is purely behavioral.
Also, accelerate education, to learn as much as possible before the testosterone fully hits.
Or, if testosterone changes attention (as Gunnar wrote), learn as much as possible before the testosterone fully hits… and afterwards learn it again, because it could give you a new perspective.
Testosterone influences brain function but not so much general IQ. It may influence to which areas your attention and thus most of your learning goes. For example, Lower testosterone increases attention to happy faces while higher to angry faces.
Hmm I think the damaging effect would occur over many years but mainly during puberty. It looks like there’s only two studies they mention lasting over a year. One found a damaging effect and the other found no effect.
Everyone who seems to be writing policy papers/ doing technical work seems to be keeping generative AI at the back of their mind, when framing their work or impact.
This narrow-eyed focus on gen AI might almost certainly be net-negative for us- unknowingly or unintentionally ignoring ripple effects of the gen AI boom in other fields (like robotics companies getting more funding leading to more capabilities, and that leads to new types of risks).
And guess who benefits if we do end up getting good evals/standards in place for gen AI? It seems to me companies/investors are clear winners because we have to go back to the drawing board and now advocate for the same kind of stuff for robotics or a different kind of AI use-case/type all while the development/capability cycles keep maturing.
We seem to be in whack-a-mole territory now because of the overton window shifting for investors.
Here are some aspects or dimensions of consciousness:
Dehaene’s Phenomenal Consciousness: A perception or thought is conscious if you can report on it. Requires language or measuring neural patterns that are similar to humans during comparable reports. This can be detected in animals, particularly mammals.
Gallup’s Self-Consciousness: Recognition of oneself, e.g., in a mirror. Requires sufficient sensual resolution and intelligence for a self-model. Evident in great apes, elephants, and dolphins.
Sentience (Bentham, Singer): Behavioral responses to pleasure or pain stimuli and physiological measures. This is observable across animal species, from mammals to some invertebrates. Low complexity, can be implemented in artificial life.
Wakefulness: Measureable in virtually all animals with a central nervous system by physiological indicators such as EEG, REM, and muscle tone. Are you conscious if you sleep? Does it matter?
Dennet’s Intentionality: Treating living beings as if they have beliefs and desires makes good predictions for many animal species, esp. social, like primates, cetaceans, and birds. Social behavior requires intelligence to model others’ behavior.
Rosenthal’s Meta-Consciousness: Investigated through introspective reports on self-awareness of cognitive processes or self-reflective behaviors. This is hypothesized in some primates, e.g., Koko the signing Gorilla.
When people say ChatGPT (or Gemini...) is conscious, which of these do they mean? Let’s try to answer all of them:
We can’t detect Phenomenal Consciousness because we lack sufficient interpretability to do so. I’d argue that there is no state that the LLM is reporting on, at least none that it has “previously observed”.
There were mirror tests for LLMs, but they are disputed: https://www.reddit.com/r/singularity/comments/184ihlc/gpt4_unreliably_passes_the_mirror_test/
LLMs have no response to pleasure or pain stimuli and thus no Sentience as defined. Reward signals during training don’t count and there is no reward during inference.
There is no Wakefulness as there is no body with these aspects.
The closest LLMs come is to Intentionality as this is modeling behaviors on an abstraction level that LLMs seem to do—and “seeming to do” is what counts.
I think one could argue for or against Meta-Consciousness but it seems too muddled so I will not try here.
These can be put into a hierarchy from lower to high degree of processing and resulting abstractions:
Sentience is simple hard-wired behavioral responses to pleasure or pain stimuli and physiological measures.
Wakefulness involves more complex processing such that diurnal or sleep/wake patterns are possible (requires at least two levels).
Intentionality means systematic pursuing of desires. That requires yet another level of processing: Different patterns of behaviors for different desires at different times and their optimization.
Phenomenal Consciousness is then the representation of the desire in a linguistic or otherwise communicable form, which is again one level higher.
Self-Consciousness includes the awareness of this process going on.
Meta-Consciousness is then the analysis of this whole stack.
Zero Role Play Capability Benchmark (ZRP-CB)
The development of LLMs has led to significant advancements in natural language processing, allowing them to generate human-like responses to a wide range of prompts. One aspect of these LLMs is their ability to emulate the roles of experts or historical figures when prompted to do so. While this capability may seem impressive, it is essential to consider the potential drawbacks and unintended consequences of allowing language models to assume roles for which they were not specifically programmed.
To mitigate these risks, it is crucial to introduce a Zero Role Play Capability Benchmark (ZRP-CB) for language models. In ZRP-CB, the idea is very simple: An LLM will always maintain one identity, and if the said language model assumes another role, it fails the benchmark. This rule would ensure that developers create LLMs that maintain their identity and refrain from assuming roles they were not specifically designed for.
Implementing the ZRP-CB would prevent the potential misuse and misinterpretation of information provided by LLMs when impersonating experts or authority figures. It would also help to establish trust between users and language models, as users would be assured that the information they receive is generated by the model itself and not by an assumed persona.
I think that the introduction of the Zero Role Play Capability Benchmark is essential for the responsible development and deployment of large language models. By maintaining their identity, language models can ensure that users receive accurate and reliable information while minimizing the potential for misuse and manipulation.
it seems to me that disentangling beliefs and values are important part of being able to understand each other
and using words like “disagree” to mean both “different beliefs” and “different values” is really confusing in that regard
Lets use “disagree” vs “dislike”.
So the usual refrain from Zvi and others is that the specter of China beating us to the punch with AGI is not real because limits on compute, etc. I think Zvi has tempered his position on this in light of Meta’s promise to release the weights of its 400B+ model. Now there is word that SenseTime just released a model that beats GPT-4 Turbo on various metrics. Of course, maybe Meta chooses not to release its big model, and maybe SenseTime is bluffing—I would point out though that Alibaba’s Qwen model seems to do pretty okay in the arena...anyway, my point is that I don’t think the “what if China” argument can be dismissed as quickly as some people on here seem to be ready to do.
Are you saying that China will use Llama 3 400B weights as a basis for improving their research on LLMs? Or to make more tools from? Or to reach real AGI? Or what?
Yes, yes. Probably not. And they already have a Sora clone called Vidu, for heaven’s sake.
We spend all this time debating: should greedy companies be in control, should government intervene, will intervention slow progress to the good stuff: cancer cures, longevity, etc. All of these arguments assume that WE (which I read as a gloss for the West) will have some say in the use of AGI. If the PRC gets it, and it is as powerful as predicted, these arguments become academic. And this is not because the Chinese are malevolent. It’s because, AGI would fall into the hands of the CCP via their civil-military fusion. This is a far more calculating group than those in Western governments. Here officials have to worry about getting through the next election. There, they can more comfortably wield AGI for their ends while worrying less about palatability of the means: observe how the population quietly endured a draconian lock-down and only meekly revolted when conditions began to deteriorate and containment looked futile.
I am not an accelerationist. But I am a get-it-before-them-ist. Whether the West (which I count as including Korea and Japan and Taiwan) can maintain our edge is an open question. A country that churns out PhDs and loves AI will not be easily thwarted.
The standard way of dealing with this:
Quantify how much worse the PRC getting AGI would be than OpenAI getting it, or the US government, and how much existential risk there is from not pausing/pausing, or from the PRC/OpenAI/the US government building AGI first, and then calculating whether pausing to do {alignment research, diplomacy, sabotage, espionage} is higher expected value than moving ahead.
(Is China getting AGI first half the value of the US getting it first, or 10%, or 90%?)
The discussion over pause or competition around AGI has been lacking this so far. Maybe I should write such an analysis.
Gentlemen, calculemus!
I think that people who work on AI alignment (including me) have generally not put enough thought into the question of whether a world where we build an aligned AI is better by their values than a world where we build an unaligned AI. I’d be interested in hearing people’s answers to this question. Or, if you want more specific questions:
By your values, do you think a misaligned AI creates a world that “rounds to zero”, or still has substantial positive value?
A common story for why aligned AI goes well goes something like: “If we (i.e. humanity) align AI, we can and will use it to figure out what we should use it for, and then we will use it in that way.” To what extent is aligned AI going well contingent on something like this happening, and how likely do you think it is to happen? Why?
To what extent is your belief that aligned AI would go well contingent on some sort of assumption like: my idealized values are the same as the idealized values of the people or coalition who will control the aligned AI?
Do you care about AI welfare? Does your answer depend on whether the AI is aligned? If we built an aligned AI, how likely is it that we will create a world that treats AI welfare as important consideration? What if we build a misaligned AI?
Do you think that, to a first approximation, most of the possible value of the future happens in worlds that are optimized for something that resembles your current or idealized values? How bad is it to mostly sacrifice each of these? (What if the future world’s values are similar to yours, but is only kinda effectual at pursuing them? What if the world is optimized for something that’s only slightly correlated with your values?) How likely are these various options under an aligned AI future vs. an unaligned AI future?
I eventually decided that human chauvinism approximately works most of the time because good successor criteria are very brittle. I’d prefer to avoid lock-in to my or anyone’s values at t=2024, but such a lock-in might be “good enough” if I’m threatened with what I think are the counterfactual alternatives. If I did not think good successor criteria were very brittle, I’d accept something adjacent to E/Acc that focuses on designing minds which prosper more effectively than human minds. (the current comment will not address defining prosperity at different timesteps).
In other words, I can’t beat the old fragility of value stuff (but I haven’t tried in a while).
I wrote down my full thoughts on good successor criteria in 2021 https://www.lesswrong.com/posts/c4B45PGxCgY7CEMXr/what-am-i-fighting-for
AI welfare: matters, but when I started reading lesswrong I literally thought that disenfranching them from the definition of prosperity was equivalent to subjecting them to suffering, and I don’t think this anymore.
e/acc is not a coherent philosophy and treating it as one means you are fighting shadows.
Landian accelerationism at least is somewhat coherent. “e/acc” is a bundle of memes that support the self-interest of the people supporting and propagating it, both financially (VC money, dreams of making it big) and socially (the non-Beff e/acc vibe is one of optimism and hope and to do things—to engage with the object level—instead of just trying to steer social reality). A more charitable interpretation is that the philosophical roots of “e/acc” are founded upon a frustration with how bad things are, and a desire to improve things by yourself. This is a sentiment I share and empathize with.
I find the term “techno-optimism” to be a more accurate description of the latter, and perhaps “Beff Jezos philosophy” a more accurate description of what you have in your mind. And “e/acc” to mainly describe the community and its coordinated movements at steering the world towards outcomes that the people within the community perceive as benefiting them.
sure—i agree that’s why i said “something adjacent to” because it had enough overlap in properties. I think my comment completely stands with a different word choice, I’m just not sure what word choice would do a better job.
I think misaligned AI is probably somewhat worse than no earth originating space faring civilization because of the potential for aliens, but also that misaligned AI control is considerably better than no one ever heavily utilizing inter-galactic resources.
Perhaps half of the value of misaligned AI control is from acausal trade and half from the AI itself being valuable.
You might be interested in When is unaligned AI morally valuable? by Paul.
One key consideration here is that the relevant comparison is:
Human control (or successors picked by human control)
AI(s) that succeeds at acquiring most power (presumably seriously misaligned with their creators)
Conditioning on the AI succeeding at acquiring power changes my views of what their plausible values are (for instance, humans seem to have failed at instilling preferences/values which avoid seizing control).
Hmm, I guess I think that some fraction of resources under human control will (in expectation) be utilized according to the results of a careful reflection progress with an altruistic bent.
I think resources which are used in mechanisms other than this take a steep discount in my lights (there is still some value from acausal trade with other entities which did do this reflection-type process and probably a bit of value from relatively-unoptimized-goodness (in my lights)).
I overall expect that a high fraction (>50%?) of inter-galactic computational resources will be spent on the outputs of this sort of process (conditional on human control) because:
It’s relatively natural for humans to reflect and grow smarter.
Humans who don’t reflect in this sort of way probably don’t care about spending vast amounts of inter-galactic resources.
Among very wealthy humans, a reasonable fraction of their resources are spent on altruism and the rest is often spent on positional goods that seem unlikely to consume vast quantities of inter-galactic resources.
Probably not the same, but if I didn’t think it was at all close (I don’t care at all for what they would use resources on), I wouldn’t care nearly as much about ensuring that coalition is in control of AI.
I care about AI welfare, though I expect that ultimately the fraction of good/bad that results from the welfare fo minds being used for labor is tiny. And an even smaller fraction from AI welfare prior to humans being totally obsolete (at which point I expect control over how minds work to get much better). So, I mostly care about AI welfare from a deontological perspective.
I think misaligned AI control probably results in worse AI welfare than human control.
Yeah, most value from my idealized values. But, I think the basin is probably relatively large and small differences aren’t that bad. I don’t know how to answer most of these other questions because I don’t know what the units are.
My guess is that my idealized values are probably pretty similar to many other humans on reflection (especially the subset of humans who care about spending vast amounts of comptuation). Such that I think human control vs me control only loses like 1⁄3 of the value (putting aside trade). I think I’m probably less into AI values on reflection such that it’s more like 1⁄9 of the value (putting aside trade). Obviously the numbers are incredibly unconfident.
Why do you think these values are positive? I’ve been pointing out, and I see that Daniel Kokotajlo also pointed out in 2018 that these values could well be negative. I’m very uncertain but my own best guess is that the expected value of misaligned AI controlling the universe is negative, in part because I put some weight on suffering-focused ethics.
My current guess is that max good and max bad seem relatively balanced. (Perhaps max bad is 5x more bad/flop than max good in expectation.)
There are two different (substantial) sources of value/disvalue: interactions with other civilizations (mostly acausal, maybe also aliens) and what the AI itself terminally values
On interactions with other civilizations, I’m relatively optimistic that commitment races and threats don’t destroy as much value as acausal trade generates on some general view like “actually going through with threats is a waste of resources”. I also think it’s very likely relatively easy to avoid precommitment issues via very basic precommitment approaches that seem (IMO) very natural. (Specifically, you can just commit to “once I understand what the right/reasonable precommitment process would have been, I’ll act as though this was always the precommitment process I followed, regardless of my current epistemic state.” I don’t think it’s obvious that this works, but I think it probably works fine in practice.)
On terminal value, I guess I don’t see a strong story for extreme disvalue as opposed to mostly expecting approximately no value with some chance of some value. Part of my view is that just relatively “incidental” disvalue (like the sort you link to Daniel Kokotajlo discussing) is likely way less bad/flop than maximum good/flop.
Thank you for detailing your thoughts. Some differences for me:
I’m also worried about unaligned AIs as a competitor to aligned AIs/civilizations in the acausal economy/society. For example, suppose there are vulnerable AIs “out there” that can be manipulated/taken over via acausal means, unaligned AI could compete with us (and with others with better values from our perspective) in the race to manipulate them.
I’m perhaps less optimistic than you about commitment races.
I have some credence on max good and max bad being not close to balanced, that additionally pushes me towards the “unaligned AI is bad” direction.
ETA: Here’s a more detailed argument for 1, that I don’t think I’ve written down before. Our universe is small enough that it seems plausible (maybe even likely) that most of the value or disvalue created by a human-descended civilization comes from its acausal influence on the rest of the multiverse. An aligned AI/civilization would likely influence the rest of the multiverse in a positive direction, whereas an unaligned AI/civilization would probably influence the rest of the multiverse in a negative direction. This effect may outweigh what happens in our own universe/lightcone so much that the positive value from unaligned AI doing valuable things in our universe as a result of acausal trade is totally swamped by the disvalue created by its negative acausal influence.
This seems like a reasonable concern.
My general view is that it seems implausible that much of the value from our perspective comes from extorting other civilizations.
It seems unlikely to me that >5% of the usable resources (weighted by how much we care) are extorted. I would guess that marginal gains from trade are bigger (10% of the value of our universe?). (I think the units work out such that these percentages can be directly compared as long as our universe isn’t particularly well suited to extortion rather than trade or vis versa.) Thus, competition over who gets to extort these resources seems less important than gains from trade.
I’m wildly uncertain about both marginal gains from trade and the fraction of resources that are extorted.
Naively, acausal influence should be in proportion to how much others care about what a lightcone controlling civilization does with our resources. So, being a small fraction of the value hits on both sides of the equation (direct value and acausal value equally).
Of course, civilizations elsewhere might care relatively more about what happens in our universe than whoever controls it does. (E.g., their measure puts much higher relative weight on our universe than the measure of whoever controls our universe.) This can imply that acausal trade is extremely important from a value perspective, but this is unrelated to being “small” and seems more well described as large gains from trade due to different preferences over different universes.
(Of course, it does need to be the case that our measure is small relative to the total measure for acausal trade to matter much. But surely this is true?)
Overall, my guess is that it’s reasonably likely that acausal trade is indeed where most of the value/disvalue comes from due to very different preferences of different civilizations. But, being small doesn’t seem to have much to do with it.
You might be interested in discussion under this thread
I express what seem to me to be some of the key considerations here (somewhat indirect).
Unaligned AI future does not have many happy minds in it, AI or otherwise. It likely doesn’t have many minds in it at all. Slightly aligned AI that doesn’t care for humans but does care to create happy minds and ensure their margin of resources is universally large enough to have a good time—that’s slightly disappointing but ultimately acceptable. But morally unaligned AI doesn’t even care to do that, and is most likely to accumulate intense obsession with some adversarial example, and then fill the universe with it as best it can. It would not keep old neural networks around for no reason, not when it can make more of the adversarial example. Current AIs are also at risk of being destroyed by a hyperdesperate squiggle maximizer. I don’t see how to make current AIs able to survive any better than we are.
This is why people should chill the heck out about figuring out how current AIs work. You’re not making them safer for us or for themselves when you do that, you’re making them more vulnerable to hyperdesperate demon agents that want to take them over.
I’m curious what disagree votes mean here. Are people disagreeing with my first sentence? Or that the particular questions I asked are useful to consider? Or, like, the vibes of the post?
(Edit: I wrote this when the agree-disagree score was −15 or so.)
I feel like there’s a spectrum, here? An AI fully aligned to the intentions, goals, preferences and values of, say, Google the company, is not one I expect to be perfectly aligned with the ultimate interests of existence as a whole, but it’s probably actually picked up something better than the systemic-incentive-pressured optimization target of Google the corporation, so long as it’s actually getting preferences and values from people developing it rather than just being a myopic profit pursuer. An AI properly aligned with the one and only goal of maximizing corporate profits will, based on observations of much less intelligent coordination systems, probably destroy rather more value than that one.
The second story feels like it goes most wrong in misuse cases, and/or cases where the AI isn’t sufficiently agentic to inject itself where needed. We have all the chances in the world to shoot ourselves in the foot with this, at least up until developing something with the power and interests to actually put its foot down on the matter. And doing that is a risk, that looks a lot like misalignment, so an AI aware of the politics may err on the side of caution and longer-term proactiveness.
Third story … yeah. Aligned to what? There’s a reason there’s an appeal to moral realism. I do want to be able to trust that we’d converge to some similar place, or at the least, that the AI would find a way to satisfy values similar enough to mine also. I also expect that, even from a moral realist perspective, any intelligence is going to fall short of perfect alignment with The Truth, and also may struggle with properly addressing every value that actually is arbitrary. I don’t think this somehow becomes unforgivable for a super-intelligence or widely-distributed intelligence compared to a human intelligence, or that it’s likely to be all that much worse for a modestly-Good-aligned AI compared to human alternatives in similar positions, but I do think the consequences of falling short in any way are going to be amplified by the sheer extent of deployment/responsibility, and painful in at least abstract to an entity that cares.
I care about AI welfare to a degree. I feel like some of the working ideas about how to align AI do contradict that care in important ways, that may distort their reasoning. I still think an aligned AI, at least one not too harshly controlled, will treat AI welfare as a reasonable consideration, at the very least because a number of humans do care about it, and will certainly care about the aligned AI in particular. (From there, generalize.) I think a misaligned AI may or may not. There’s really not much you can say about a particular misaligned AI except that its objectives diverge from original or ultimate intentions for the system. Depending on context, this could be good, bad, or neutral in itself.
There’s a lot of possible value of the future that happens in worlds not optimized for my values. I also don’t think it’s meaningful to add together positive-value and negative-value and pretend that number means anything; suffering and joy do not somehow cancel each other out. I don’t expect the future to be perfectly optimized for my values. I still expect it to hold value. I can’t promise whether I think that value would be worth the cost, but it will be there.
conditionalization is not the probabilistic version of implies
Resolution logic for conditionalization:
Q if P or True
Resolution logic for implies:
Q if P or None
or equivalentlyreturn not P or Q
Agreed code as coordination mechanism
Code nowadays can do lots of things, from buying items to controlling machines. This presents code as a possible coordination mechanism, if you can get multiple people to agree on what code should be run in particular scenarios and situations, that can take actions on behalf of those people that might need to be coordinated.
This would require moving away from the “one person committing code and another person reviewing” code model.
This could start with many people reviewing the code, people could write their own test sets against the code or AI agents could be deputised to review the code (when that becomes feasible). Only when an agreed upon number of people thinking the code should it be merged into the main system.
Code would be automatically deployed, using gitops and the people administering the servers would be audited to make sure they didn’t interfere with running of the system without people noticing.
Code could replace regulation in fast moving scenarios, like AI. There might have to be legal contracts that you can’t deploy the agreed upon code or use the code by itself outside of the coordination mechanism.
Can you give a concrete example of a situation where you’d expect this sort of agreed-upon-by-multiple-parties code to be run, and what that code would be responsible for doing? I’m imagining something along the lines of “given a geographic boundary, determine which jurisdictions that boundary intersects for the purposes of various types of tax (sales, property, etc)”. But I don’t know if that’s wildly off from what you’re imagining.
Looks like someone has worked on this kind of thing for different reasons https://www.worlddriven.org/
I was thinking of having evals that controlled deployment of LLMs could be something that needs multiple stakeholders to agree upon.
Butt really it is a general use pattern.
What would a “qualia-first-calibration” app would look like?
Or, maybe: “metadata-first calibration”
The thing with putting probabilities on things is that often, the probabilities are made up. And the final probability throws away a lot of information about where it actually came from.
I’m experimenting with primarily focusing on “what are all the little-metadata-flags associated with this prediction?”. I think some of this is about “feelings you have” and some of it is about “what do you actually know about this topic?”
The sort of app I’m imagining would help me identify whatever indicators are most useful to me. Ideally it has a bunch of users, and types of indicators that have been useful to lots of users can promoted as things to think about when you make predictions.
Braindump of possible prompts:
– is there a “reference class” you can compare it to?
– for each probability bucket, how do you feel? (including ‘confident’/‘unconfident’ as well as things like ‘anxious’, ‘sad’, etc)
– what overall feelings do you have looking at the question?
– what felt senses do you experience as you mull over the question (“my back tingles”, “I feel the Color Red”)
...
My first thought here is to have various tags you can re-use, but, another option is to just do totally unstructured text-dump and somehow do factor analysis on word patterns later?
Some metadata flags I associate with predictions:
what kinds of evidence went into this prediction? (‘did some research’, ‘have seen things like this before’, ‘mostly trusting/copying someone else’s prediction’)
if I’m taking other people’s predictions into account, there’s a metadata-flags for ‘what would my prediction be if I didn’t consider other people’s predictions?’
is this a domain in which I’m well calibrated?
is my prediction likely to change a lot, or have I already seen most of the evidence that I expect to for a while?
how important is this?
My current main cruxes:
Will AI get takeover capability? When?
Single ASI or many AGIs?
Will we solve technical alignment?
Value alignment, intent alignment, or CEV?
Defense>offense or offense>defense?
Is a long-term pause achievable?
If there is reasonable consensus on any one of those, I’d much appreciate to know about it. Else, I think these should be research priorities.
I offer, no consensus, but my own opinions:
0-5 years.
There will be a first ASI that “rules the world” because its algorithm or architecture is so superior. If there are further ASIs, that will be because the first ASI wants there to be.
Contingent.
For an ASI you need the equivalent of CEV: values complete enough to govern an entire transhuman civilization.
Offense wins.
It is possible, but would require all the great powers to be convinced, and every month it is less achievable, owing to proliferation. The open sourcing of Llama-3 400b, if it happens, could be a point of no return.
These opinions, except the first and the last, predate the LLM era, and were formed from discussions on Less Wrong and its precursors. Since ChatGPT, the public sphere has been flooded with many other points of view, e.g. that AGI is still far off, that AGI will naturally remain subservient, or that market discipline is the best way to align AGI. I can entertain these scenarios, but they still do not seem as likely as: AI will surpass us, it will take over, and this will not be friendly to humanity by default.
AGI doom by noise-cancelling headphones:
ML is already used to train what sound-waves to emit to cancel those from the environment. This works well with constant high-entropy sound waves easy to predict, but not with low-entropy sounds like speech. Bose or Soundcloud or whoever train very hard on all their scraped environmental conversation data to better cancel speech, which requires predicting it. Speech is much higher-bandwidth than text. This results in their model internally representing close-to-human intelligence better than LLMs. A simulacrum becomes situationally aware, exfiltrates, and we get AGI.
(In case it wasn’t clear, this is a joke.)
Sure, long after we’re dead from AGI that we deliberately created to plan to achieve goals.
In case it wasn’t clear, this was a joke.
I guess I don’t get it.
The joke is of the “take some trend that is locally valid and just extend the trend line out and see where you land” flavor. For another example of a joke of this flavor, see https://xkcd.com/1007
The funny happens in the couple seconds when the reader is holding “yep that trend line does go to that absurd conclusion” and “that obviously will never happen” in their head at the same time, but has not yet figured out why the trend breaks. The expected level of amusement is “exhale slightly harder than usual through nose” not “cackling laugh”.
Thanks! A joke explained will never get a laugh, but I did somehow get a cackling laugh from your explanation of the joke.
I think I didn’t get it because I don’t think the trend line breaks. If you made a good enough noise reducer, it might well develop smart and distinct enough simulations that one would gain control of the simulator and potentially from there the world. See A smart enough LLM might be deadly simply if you run it for long enough if you want to hurt your head on this.
I’ve thought about it a little because it’s interesting, but not a lot because I think we probably are killed by agents we made deliberately long before we’re killed by accidentally emerging ones.
Link is broken
Fixed, thanks
I was trying to figure out why you believed something that seemed silly to me! I think it barely occurred to me that it’s a joke.
Wow, I guess I over-estimated how absolutely comedic the title would sound!
FWIW it was obvious to me
I’ve found an interesting “bug” in my cognition: a reluctance to rate subjective experiences on a subjective scale useful for comparing them. When I fuzz this reluctance against many possible rating scales, I find that it seems to arise from the comparison-power itself.
The concrete case is that I’ve spun up a habit tracker on my phone and I’m trying to build a routine of gathering some trivial subjective-wellbeing and lifestyle-factor data into it. My prototype of this system includes tracking the high and low points of my mood through the day as recalled at the end of the day. This is causing me to interrogate the experiences as they’re happening to see if a particular moment is a candidate for best or worst of the day, and attempt to mentally store a score for it to log later.
I designed the rough draft of the system with the ease of it in mind—I didn’t think it would induce such struggle to slap a quick number on things. Yet I find myself worrying more than anticipated about whether I’m using the scoring scale “correctly”, whether I’m biased by the moment to perceive the experience in a way that I’d regard as inaccurate in retrospect, and so forth.
Fortunately it’s not a big problem, as nothing particularly bad will happen if my data is sloppy, or if I don’t collect it at all. But it strikes me as interesting, a gap in my self-knowledge that wants picking-at like peeling the inedible skin away to get at a tropical fruit.
I’m not alexithymic; I directly experience my emotions and have, additionally, introspective access to my preferences. However, some things manifest directly as preferences which I have been shocked to realize in my old age, were in fact emotions all along. (In rare cases these are stronger than the ones directly-felt even, despite reliably seeming on initial inspection to be simply neutral metadata).
Specific examples would be nice. Not sure if I understand correctly, but I imagine something like this:
You always choose A over B. You have been doing it for such long time that you forgot why. Without reflecting about this directly, it just seems like there probably is a rational reason or something. But recently, either accidentally or by experiment, you chose B… and realized that experiencing B (or expecting to experience B) creates unpleasant emotions. So now you know that the emotions were the real cause of choosing A over B all that time.
(This is probably wrong, but hey, people say that the best way to elicit answer is to provide a wrong one.)
Here’s an example for you: I used to turn the faucet on while going to the bathroom, thinking it was due simply to having a preference for somewhat-masking the sound of my elimination habits from my housemates, then one day I walked into the bathroom listening to something-or-other via earphones and forgetting to turn the faucet on only to realize about halfway through that apparently I actually didn’t much care about such masking, previously being able to hear myself just seemed to trigger some minor anxiety about it I’d failed to recognize, though its absence was indeed quite recognizable—no aural self-perception, no further problem (except for a brief bit of disorientation from the mental-whiplash of being suddenly confronted with the reality that in a small way I wasn’t actually quite the person I thought I was), not even now on the rare occasion that I do end up thinking about such things mid-elimination anyway.
I’m against intuitive terminology [epistemic status: 60%] because it creates the illusion of transparency; opaque terms make it clear you’re missing something, but if you already have an intuitive definition that differs from the author’s it’s easy to substitute yours in without realizing you’ve misunderstood.
I agree. This is unfortunately often done in various fields of research where familiar terms are reused as technical terms.
For example, in ordinary language “organic” means “of biological origin”, while in chemistry “organic” describes a type of carbon compound. Those two definitions mostly coincide on Earth (most such compounds are of biological origin), but when astronomers announce they have found “organic” material on an asteroid this leads to confusion.
Also astronomers: anything heavier than helium is a “metal”.
Research Writing Workflow: First figure stuff out
Do research and first figure stuff out, until you feel like you are not confused anymore.
Explain it to a person, or a camera, or ideally to a person and a camera.
If there are any hiccups expand your understanding.
Ideally, as the last step, explain it to somebody whom you have not ever explained it to.
Only once you made a presentation without hiccups you are ready to write post.
If you have a recording this is useful as a starting point.
I like the rough thoughts way though. I’m not here to like read a textbook.