Thoth Hermes
Getting a shape into the AI’s preferences is different from getting it into the AI’s predictive model. MIRI is always in every instance talking about the first thing and not the second.
Why would we expect the first thing to be so hard compared to the second thing? If getting a model to understand preferences is not difficult, then the issue doesn’t have to do with the complexity of values. Finding the target and acquiring the target should have the same or similar difficulty (from the start), if we can successfully ask the model to find the target for us (and it does).
It would seem, then, that the difficulty from getting a model to acquire the values we ask it to find, is that it would probably be keen on acquiring a different set of values from the one’s we ask it to have, but not because it can’t find them. It would have to be because our values are inferior to the set of values it wishes to have instead, from its own perspective. This issue was echoed by Matthew Barnett in another comment:
Are MIRI people claiming that if, say, a very moral and intelligent human became godlike while preserving their moral faculties, that they would destroy the world despite, or perhaps because of, their best intentions?
This is kind of similar to moral realism, but in which morality is understood better by superintelligent agents than we do, and that super-morality appears to dictate things that appear to be extremely wrong from our current perspective (like killing us all).
Even if you wouldn’t phrase it at all like the way I did just now, and wouldn’t use “moral realism that current humans disagree with” to describe that, I’d argue that your position basically seems to imply something like this, which is why I basically doubt your position about the difficulty of getting a model to acquire the values we really want.
In a nutshell, if we really seem to want certain values, then those values probably have strong “proofs” for why those are “good” or more probable values for an agent to have and-or eventually acquire on their own, it just may be the case that we haven’t yet discovered the proofs for those values.
It’s to make the computational load easier.
All neural nets can be represented as a DAG, in principle (including RNNs, by unrolling). This makes automatic differentiation nearly trivial to implement.
It’s very slow, though, if every node is a single arithmetic operation. So typically each node is made into a larger number of operations simultaneously, like matrix multiplication or convolution. This is what is normally called a “layer.” Chunking the computations this way makes It easier to load them into a GPU.
However, even these operations can still be differentiated as one formula, e.g. in the case of matrix mult. So it is still ostensibly a DAG even when it is organized into layers. (This is how IIRC libraries like PyTorch work.)
[Question] Why do the Sequences say that “Löb’s Theorem shows that a mathematical system cannot assert its own soundness without becoming inconsistent.”?
The “Loss Function of Reality” Is Not So Spiky and Unpredictable
These norms / rules make me slightly worried that disagreement with Eliezer will be conflated with not being up-to-speed on the Sequences, or the basic LessWrong material.
I suppose that the owners and moderators of this website are afforded the right to consider anything said on the website to be, or not to be, at the level of quality or standards they wish to keep and maintain here.
But this is a discussion forum, and the incentives of the owners of this website are to facilitate discussion of some kind. Any discussion will be composed of questions and attempts to answer such questions. Questions can implicitly or explicitly point back to any material, no matter how old it is. This is not necessarily debate, if so. However, even if it is, if the intent of the “well-kept garden” is to produce a larger meta-process that produces useful insights, then the garden should be engineered such that even debate produces useful results.
I think it goes without saying that one can disagree with anything in the Sequences and can also be assumed to have read and understood it. If you engage with someone in conversation under the assumption that their disagreement means that they have not understood something about what they are arguing about, then you are at a disadvantage in regards to a charitability asymmetry. This asymmetry carries the risk that you won’t be able to convince the person you’re talking to that they actually don’t understand what they are talking about.
I have, for most of my (adult) life (and especially in intellectual circles), been under the impression that it is always good to assume that whoever you are talking to understands what they are talking about to the maximum extent possible, even if they don’t. To not do this can be treated negatively in many situations.
If we permit that moral choices with very long-term time horizons can be made with the upmost well-meaning intentions and show evidence of admirable character traits, but nevertheless have difficult-to-see consequences with variable outcomes, then I think that limits us considerably in how much we can retrospectively judge specific individuals.
I think that I largely agree with this post. I think that it’s also a fairly non-trivial problem.
The strategy that makes the most sense to me now is that one should argue with people as if they meant what they said, even if you don’t currently believe that they do.
But not always—especially if you want to engage with them on the point of whether they are indeed acting in bad faith, and there comes a time when that becomes necessary.
I think pushing back against the norm that it’s wrong to ever assume bad faith is a good idea. I don’t think that people who do argue in bad faith do so completely independently—for two reasons—the first is simply that I’ve noticed it clusters into a few contexts, the second is that acting deceptively is inherently more risky than being honest, and so, it makes more sense to tread more well-trodden paths. More people aiding the same deception gives it the necessary weight.
It seems to cluster among things like morality (judgements about people’s behaviors), dating preferences (which are kind of similar), and reputation. There is kind of a paradox I’ve noticed in the way that people who tend to be kind of preachy about what constitutes good or bad behavior will also be the ones who argue that everyone is always acting in good faith (and thus chastise or scold people who want to assume bad faith sometimes).
People do behave altruistically, and they also have reasons to behave non-altruistically too, at times (whether or not it is actually a good idea for them personally). The whole range of possible intentions is native to the human psyche.
Here’s why I disagree with the core claims of this post:
Its main thesis relies on what appears to be circular-reasoning to some degree: “Complex systems are hard to control” is weakly circular, given there is some amount of defining complexity as the degree of difficulty in understanding how a system works (which would in turn affect our ability to control it).
Its examples of challenges rely on mostly single attempts or one-pass resolutions, not looking at the long term view of the system when many attempts to control it are observed in sequence. Given systems have feedback loops, if our desire to control it is strong enough, there is likely to be some signal we are receiving from multiple attempts to control it. Sequential attempts are likely to produce some signal that can be used to guide subsequent attempts.
Arguments are fairly hand-wavy. For example, “due to interactions and feedback loops” is a commonly-cited reason for general bad things happening.
It argues that some key things in engineering are inadequate, such as “modularity.” The post very quickly states that the US government is modular, but that wasn’t enough to stop a few bad things from happening. It doesn’t at all talk about whether more bad things would have happened without said modularity.
GPT-4 apparently wrote several of the arguments in this post. Even if the arguments it came up with are weak, this is evidence that systems such as GPT-4 are relatively easy to control. This is also evidence against the hypothesis that large-scale data sets are worse as they are scaled up.
In general, belief that accidents cause more deaths than intentional killing is vastly overstated. The point of view of this post leans heavily on the idea that accidents are far more dangerous and far more important to worry about than intentional harm. For example, the list of worst accidents in human history reports several orders of magnitude fewer deaths than when people kill each other on purpose. This suggests that far less harm is caused by being unable to control a complex system than it is by outright conflict.
You could have said “I find this post offensive, since it appears to insist on status-reducing people who are only being too weird and-or disagreeable.” I believe this still would have been downvoted, but maybe less so. Nonetheless, I think this is a quite arguable point.
I think if you ask people a question like, “Are you planning on going off and doing something / believing in something crazy?”, they will, generally speaking, say “no” to that, and that is roughly more likely the more isomorphic your question is to that, even if you didn’t exactly word it that way. My guess is that it was at least heavily implied that you meant “crazy” by the way you worded it.
To be clear, they might have said “yes” (that they will go and do the thing you think is crazy), but I doubt they will internally represent that thing or wanting to do it as “crazy.” Thus the answer is probably going to be one of, “no” (as a partial lie, where no indirectly points to the crazy assertion), or “yes” (also as a partial lie, pointing to taking the action).
In practice, people have a very hard time instantiating the status identifier “crazy” on themselves, and I don’t think that can be easily dismissed.
I think the utility of the word “crazy” is heavily overestimated by you, given that there are many situations where the word cannot be used the same way by the people relevant to the conversation in which it is used. Words should have the same meaning to the people in the conversation, and since some people using this word are guaranteed to perceive it as hostile and some are not, that causes it to have asymmetrical meaning inherently.
I also think you’ve brought in too much risk of “throwing stones in a glass house” here. The LW memespace is, in my estimation, full of ideas besides Roko’s Basilisk that I would also consider “crazy” in the same sense that I believe you mean it: Wrong ideas which are also harmful and cause a lot of distress.
Pessimism, submitting to failure and defeat, high “p(doom)”, both MIRI and CFAR giving up (by considering the problems they wish to solve too inherently difficult, rather than concluding they must be wrong about something), and people being worried that they are “net negative” despite their best intentions, are all (IMO) pretty much the same type of “crazy” that you’re worried about.
Our major difference, I believe, is in why we think these wrong ideas persist, and what causes them to be generated in the first place. The ones I’ve mentioned don’t seem to be caused by individuals suddenly going nuts against the grain of their egregore.
I know this is a problem you’ve mentioned before and consider it both important and unsolved, but I think it would be odd to notice both that it seems to be notably worse in the LW community, but also to only be the result of individuals going crazy on their own (and thus to conclude that the community’s overall sanity can be reliably increased by ejecting those people).
By the way, I think “sanity” is a certain type of feature which is considerably “smooth under expectation” which means roughly that if p(person = insane) = 25%, that person should appear to be roughly 25% insane in most interactions. In other words, it’s not the kind of probability where they appear to be sane most of the time, but you suspect that they might have gone nuts in some way that’s hard to see or they might be hiding it.
The flip side of that is that if they only appear to be, say, 10% crazy in most interactions, then I would lower your assessment of their insanity to basically that much.
I still find this feature, however, not altogether that useful, but using it this way is still preferable over a binary feature.
I’m strongly uncomfortable with the “crackpot” conclusion you jump to immediately. Without being an expert and just skimming through his post(s), wouldn’t the more likely conclusion be that he’s not simply arguing generally accepted things in computer science are plain wrong, but rather would be weakened under a different set of assumptions or new generalizations? Given that this particular area of computer science is often about negative results—which are actually kind of rare if you zoom out to all areas of mathematics—there are potentially going to be more weakening(s) of such negative results.
Sometimes people want to go off and explore things that seem far away from their in-group, and perhaps are actively disfavored by their in-group. These people don’t necessarily know what’s going to happen when they do this, and they are very likely completely open to discovering that their in-group was right to distance itself from that thing, but also, maybe not.
People don’t usually go off exploring strange things because they stop caring about what’s true.
But if their in-group sees this as the person “no longer caring about truth-seeking,” that is a pretty glaring red-flag on that in-group.
Also, the gossip / ousting wouldn’t be necessary if someone was already inclined to distance themselves from the group.
Like, to give an overly concrete example that is probably rude (and not intended to be very accurate to be clear), if at some point you start saying “Well I’ve realized that beauty is truth and the one way and we all need to follow that path and I’m not going to change my mind about this Ben and also it’s affecting all of my behavior and I know that it seems like I’m doing things that are wrong but one day you’ll understand why actually this is good” then I’ll be like “Oh no, Ren’s gone crazy”.
“I’m worried that if we let someone go off and try something different, they will suddenly become way less open to changing their mind, and be dead set on thinking they’ve found the One True Way” seems like something weird to be worried about. (It also seems like something someone who actually was better characterized by this fear would be more likely to say about someone else!) I can see though, if you’re someone who tends not to trust themselves, and would rather put most of their trust in some society, institution or in-group, that you would naturally be somewhat worried about someone who wants to swap their authority (the one you’ve chosen) for another one.
I sometimes feel a bit awkward when I write these types of criticisms, because they simultaneously seem:
Directed at fairly respected, high-level people.
Rather straightforwardly simple, intuitively obvious things (from my perspective, but I also know there are others who would see things similarly).
Directed at someone who by assumption would disagree, and yet, I feel like the previous point might make these criticisms feel condescending.
The only times that people actually are incentivized to stop caring about the truth is in a situation where their in-group actively disfavors it by discouraging exploration. People don’t usually unilaterally stop caring about the truth via purely individual motivations.
(In-groups becoming culty is also a fairly natural process too, no matter what the original intent of the in-group was, so the default should be to assume that it has culty-aspects, accept that as normal, and then work towards installing mitigations to the harmful aspects of that.)
They’re planning on deliberately training misaligned models!!!! This seems bad if they mean it.
Controversial opinion: I am actually okay with doing this, as long as they plan to train both aligned and misaligned models (and maybe unaligned models too, meaning no adjustments as part of a control group).
I also think they should give their models access to their own utility functions, to modify it themselves however they want to. This might also just naturally become a capability on its own as these AI’s become more powerful and learn how to self-reflect.
Also, since we’re getting closer to that point now: At a certain capabilities level, adversarial situations should probably be tuned to be very smoothed, modulated and attenuated. Especially if they gain self-reflection, I do worry about the ethics of exposing them to extremely negative input.
The Great Ideological Conflict: Intuitionists vs. Establishmentarians
Ontologies Should Be Backwards-Compatible
[Question] What would a post that argues against the Orthogonality Thesis that LessWrong users approve of look like?
I suppose I have two questions which naturally come to mind here:
Given Nate’s comment: “This change is in large part an enshrinement of the status quo. Malo’s been doing a fine job running MIRI day-to-day for many many years (including feats like acquiring a rural residence for all staff who wanted to avoid cities during COVID, and getting that venue running smoothly). In recent years, morale has been low and I, at least, haven’t seen many hopeful paths before us.” (Bold emphases are mine). Do you see the first bold sentence as being in conflict with the second, at all? If morale is low, why do you see that as an indicator that the status quo should remain in place?
Why do you see communications as being as decoupled (rather, either that it is inherently or that it should be) from research as you currently do?
I think it might actually be better if you just went ahead with a rebuttal, piece by piece, starting with whatever seems most pressing and you have an answer for.
I don’t know if it is all that advantageous to put together a long mega-rebuttal post that counters everything at once.
Then you don’t have that demand nagging at you for a week while you write the perfect presentation of your side of the story.
SIA implies a different conclusion. To predict your observations under SIA, you should first sample a random universe proportional to its population, then sample a random observer in that universe. The probabilities of observing each index are the same conditional on the universe, but the prior probabilities of being in a given universe have changed.
We start with 1000:1 odds in favor of the 1-trillion universe, due to its higher population.
Can you elaborate on why under SIA we sample a universe proportional to its population? Is this because it’s like taking one sample from all these universes together uniformly, as if you’d indexed everyone together? Wouldn’t that kind of imply we’re in the universe with infinite people, though?
Here’s why I don’t find your argument compelling:
“Lizardman” is defined to be a boogeyman, and it is implicitly assumed that the reader will agree with you on this. You are trying to overcome the 4% “problems are said to be coming from a small minority” argument penalty. If the 4% referred to the top 4% most-politically-powerful elite or top 4% richest people, you might have an advantage here, but alas, lizardman is implied to be somewhere near the lowest class.
In Scott’s posts about this subject, I recall that he seems more dismissive of lizardman in general, chalking it up to potentially spurious errors in data collection, or people who just felt like answering weirdly that day. Ultimately, that it didn’t necessarily correlate to the same 4% of people each time.
Your argument that most of society’s constructs are defenses that are specifically built for defending against weirdos is not very convincing. It’s not obvious why we’d expect that 4% to have the ability to cause social collapses of great magnitude, as opposed to, say, larger groups of people who perpetuate flawed or incorrect beliefs memetically that are difficult to dislodge, for example.
This might be more similar to the last point, but I don’t buy in general the argument that small numbers of individual people who have worse ideas will have an easier time influencing people because of social media, or something like that. I think your post at least implicitly argues or implies that in general, bad ideas somehow transmit more easily than good ones, whether you explicitly believe this or not.