It does not seem obviously hopeless to monitor Thinkish or even Neuralese. If a model uses Thinkish in its chain of thought, then that dialect of Thinkish means something to it. Perhaps the model can be prompted to translate Thinkish appearing in another instance’s chain of thought. Or perhaps a model could be fine-tuned on understanding how a given model’s Thinkish chain of thought effects its output, and, (though I’m not sure how to train for this last step) explain how it does so in a way that humans can follow. These are things that could also be tried for apparently natural-language chains of thought that have hidden meaning that the model uses but isn’t immediately apparent to humans. And since Neuralese differs from Thinkish only in that it doesn’t re-use natural language’s token space, perhaps similar techniques could be used to translate a model’s Neuralese.
AlexMennen
Does this fundraiser have a deadline?
I see that the info hovertext over the amount raised on the every.org page says that some of it was raised offline, and only lists matching funds for the remaining that wans’t raised offline. Does this mean that funds raised offline don’t get matched, that their matches from SFF was included in the “raised offline” figure, or that their matches from SFF aren’t counted in the total amount raised displayed on that page?
This post claims that Anthropic is embarrassingly far behind twitter AI psychologists at skills that are possibly critical to Anthropic’s mission. This suggests to me that Anthropic should be trying to recruit from the twitter AI psychologist circle.
In particular, your argument that putting material into the world about LLMs potentially becoming misaligned may cause problems—I agree that that’s true, but what’s the alternative? Never talking about risks from AI? That seems like it plausibly turns out worse.
I think this depends somewhat on the threat model. How scared are you of the character instantiated by the model vs the language model itself? If you’re primarily scared that the character would misbehave, and not worried about the language model misbehaving except insofar as it reifies a malign character, then maybe making the training data not give the model any reason to expect such a character to be malign would reduce the risk of this to negligible, and that sure would be easier if no one had ever thought of the idea that powerful AI could be dangerous. But if you’re also worried about the language model itself misbehaving, independently of whether it predicts that its assigned character would misbehave (for instance, the classic example of turning the world into computronium that it can use to better predict the behavior of the character), then this doesn’t seem feasible to solve without talking about it, so the decrease in risk of model misbehavior from publically discussing AI risk is probably worth the increase in risk of the character misbehaving (which is probably easier to solve anyway) that it would cause.
I don’t understand outer vs inner alignment especially well, but I think this at least roughly tracks that distinction. If a model does a great job of instantiating a character like we told it to, and that character kills us, then the goal we gave it was catastrophic, and we failed at outer alignment. If the model, in the process of being trained on how to instantiate the character, also kills us for reasons other than that it predicts the character would do so, then the process we set up for achieving the given goal also ended up optimizing for something else undesirable, and we failed at inner alignment.
It is useful for evolved mental machinery for enabling cooperation and conflict resolution to have features like what you describe, yes. I don’t agree that this points towards there being an underlying reality.
You can believe that what you do or did was unethical, which doesn’t need to have anything to do with conflict resolution.
It does relate to conflict resolution. Being motivated by ethics is useful for avoiding conflict, so it’s useful for people to be able to evaluate the ethics of their own hypothetical actions. But there are lots of considerations for people to take into account when chosing actions, so this does not mean that someone will never take actions that they concluded had the drawback of being unethical. Being able to reason about the ethics of actions you’ve already taken is additionally useful insofar as it correlates with how others are likely to see it, which can inform whether it is a good idea to hide information about your actions, be ready to try to make amends, defend yourself from retribution, etc.
Beliefs are not perceptions.
If there is some objective moral truth that common moral intuitions are heavily correlated with, there must be some mechanism by which they ended up correlated. Your reply to Karl makes it sound like you deny that anyone ever perceives anything other than perception itself, which isn’t how anyone else uses the word perceive.
It doesn’t mean that we are necessarily or fully motivated to be ethical.
Yes, but if no one was at all motivated by ethics, then ethical reasoning would not be useful for people to engage in, and no one would. The fact that ethics is a powerful force in society is central to why people bother studying it. This does not imply that everyone is motivated by ethics, or that anyone is fully motivated by ethics.
Regardless of whether the view Eliezer espouses here really counts as moral realism, as people have been arguing about, it does seem that it would claim that there is a fact of the matter about whether a given AI is a moral patient. So I appreciate your point regarding the implications for the LW Overton window. But for what it’s worth, I don’t think Eliezer succeeds at this, in the sense that I don’t think he makes a good case for it to be useful to talk about ethical questions that we don’t have firm views on as if they were factual questions, because:
1. Not everyone is familiar with the way Eliezer proposes to ground moral language, not everyone who is familiar with it will be aware that it is what any given person means when they use moral language, and some people who are aware that a given person uses moral language the way Eliezer proposes will object to them doing so. Thus using moral language in the way Eliezer proposes, whenever it’s doing any meaningful work, invites getting sidetracked on unproductive semantic discussions. (This is a pretty general-purpose objection to normative moral theories)
2. Eliezer’s characterization of the meaning of moral language relies on some assumptions about it being possible in theory for a human to eventually acquire all the relevent facts about any given moral question and form a coherent stance on it, and the stance that they eventually arrive at being robust to variations in the process by which they arrived at it. I think these assumptions are highly questionable, and shouldn’t be allowed to escape questioning by remaining implicit.
3. It offers no meaningful action guidence beyond “just think about it more”, which is reasonable, but a moral non-realist who aspires to acquire moral intuitions on a given topic would also think of that.
One could object to this line of criticism on the grounds that we should talk about what’s true independently of how it is useful to use words. But any attempt to appeal to objective truth about moral language runs into the fact that words mean what people use them to mean, and you can’t force people to use words the way you’d like them to. It looks like Eliezer kind of tries to address this by observing that extrapolated volation shares some features in common with the way people use moral language, which is true, and seems to conclude that it is the way people use moral language even if they don’t know it, which does not follow.
I agree that LessWrong comments are unlikely to resolve disagreements about moral realism. Much has been written on this topic, and I doubt I have anything new to say about it, which is why I didn’t think it would be useful to try to defend moral anti-realism in the post. I brought it up anyway because the argument in that paragraph crucially relies on moral anti-realism, I suspect many readers reject moral realism without having thought through the implications of that for AI moral patienthood, and I don’t in fact have much uncertainty about moral realism.
Regarding LessWrong consensus on this topic, I looked through a couple LessWrong surveys, and didn’t find any questions about this, so, this doesn’t prove much, but just out of curiosity, I asked Claude 4 Sonnet to predict the results of such a question, and here’s what it said (which seems like a reasonable guess to me):
*Accept moral realism**: ~8%
**Lean towards moral realism**: ~12%
**Not sure**: ~15%
**Lean against moral realism**: ~25%
**Reject moral realism**: ~40%
If our experience of qualia reflect some poorly understood phenomenon in physics, it could be part of a cluster of related phenomena, not all of which manifest in human cognition. We don’t have as precise an understanding of qualia as we do of electrons; we just try to gesture at it, and we mostly figure out what each other is talking about. If some related phenomenon manifests in computers when they run large language models, which has some things in common with what we know as qualia but also some stark differences from any such phenomen manifesting in human brains, the things we have said about what we mean when we say “qualia” might not be sufficient to determine whether said phenomenon counts as qualia or not.
It undercuts the motivation for believing in moral realism, leaving us with no evidence for objective moral facts, which is a complicated thing, and thus unlikely to exist without evidence.
I tried to address this sort of response in the original post. All of these more precise consciousness-related concepts share the commonality that they were developed using our perception of our own cognition and seeing evidence that related phenomena occur in other humans. So they are all brittle in the same way when trying to extrapolate and apply them to alien minds. I don’t think that qualia is on significantly firmer epistemic ground than consciousness is.
This is correct, but I don’t think what I was trying to express relies on Camp 1 assumptions, even though I expressed it with a Camp 1 framing. If cognition is associated with some nonphysical phenomenon, then our consciousness-related concepts are still tailored to hire this phenomenon manifests specifically in humans. There could be some related metaphysical phenomenon going on in large language model, and no objective fact as to whether “consciousness” is an appropriate word to describe it.
Human moral judgement seem easily explained as an evolutionary adaptation for cooperation and conflict resolution, and very poorly explained by perception of objective facts. If such facts did exist, this doesn’t give humans any reason to perceive or be motivated by them.
Against asking if AIs are conscious
Contemporary AI existential risk concerns originated prior to it being obvious that a dangerous AI would likely involve deep learning, so no one could claim that the arguments that existed in ~2010 involved technical details of deep learning, and you didn’t need to find anything written in the 19th century to establish this.
PSA: Before May 21 is a good time to sign up for cryonics
I might indeed want to create a precedent here and maybe try to fundraise for some substantial fraction of it.
I wonder if it might be more effective to fund legal action against OpenAI than to compensate individual ex-employees for refusing to sign an NDA. Trying to take vested equity away from ex-employees who refuse to sign an NDA sounds likely to not hold up in court, and if we can establish a legal precident that OpenAI cannot do this, that might make other ex-employees much more comfortable speaking out against OpenAI than the possibility that third-parties might fundraise to partially compensate them for lost equity would be (a possibility you might not even be able to make every ex-employee aware of). The fact that this would avoid financially rewarding OpenAI for bad behavior is also a plus. Of course, legal action is expensive, but so is the value of the equity that former OpenAI employees have on the line.
Yeah, sorry that was unclear; there’s no need for any form of hypercomputation to get an enumeration of the axioms of U. But you need a halting oracle to distinguish between the axioms and non-axioms. If you don’t care about distinguishing axioms from non-axioms, but you do want to get an assignment of truthvalues to the atomic formulas Q(i,j) that’s consistent with the axioms of U, then that is applying a consistent guessing oracle to U.
I see that when I commented yesterday, I was confused about how you had defined U. You’re right that you don’t need a consistent guessing oracle to get from U to a completion of U, since the axioms are all atomic propositions, and you can just set the remaining atomic propositions however you want. However, this introduces the problem that getting the axioms of U requires a halting oracle, not just a consistent guessing oracle, since to tell whether something is an axiom, you need to know whether there actually is a proof of a given thing in T.
I think this visual effect could plausibly be explained by polarization, without there being any real correlation between extremeness and concern about AI x-risk. Most politicians aren’t moderate, and most politicians aren’t concerned about AI x-risk. So the distribution of ideology scores of politicans at the bottom (not concerned about AI x-risk) is bimodal, and the distribution of ideology scores of politicans near the top (very concerned about AI x-risk) is bimodal, but the whole distribution is thicker at the bottom than near the top. The density of non-x-risk-concerned moderates could be high enough to be close to saturating our ability to perceive density of dots in this graphic, so that the actually much denser regions leftward and rightward aren’t readily apparent to be much denser. But higher up, the dots aren’t dense enough to saturate our ability to perceive their density, so it is visually obvious that there are more at the extremes than in the middle.