I believe in Thinking Fast and Slow, Kahneman refers to this fallacy as “What You See Is All There Is” (WYSIATI). And it used to be common for people to talk about “Unknown Unknowns” (things you don’t know, that you also don’t know you don’t know).
Ebenezer Dukakis
To me, if someone posts an intense-feeling negatively worded text in response to what other people are doing, it usually signals that there is something they care about that they perceive to be threatened. I’ve found it productive to try relate with that first, before responding. Jumping to enforcing general rules stipulated somewhere in the community, and then implying that the person not following those rules is not harmonious with or does not belong to the community, can get counterproductive.
I’m a bit concerned about a situation where “insiders” always get this sort of contextual benefit-of-the-doubt, and “outsiders” don’t.
Just to clarify here, I have no issue with you thinking the post is bad. That seems besides the point to me. My issue is with you doing much of what you accuse Miller of doing.
Insults: “This post seems completely insane to me, as do people who unquestionable retweet it.”
Aggression: “I cannot believe I have to argue for this… [cursing]...”
Sneering: “Has anyone who liked this actually read this post? How on earth is this convincing to anyone?”
Note also that the discussion around the Bentham post was previously calm and friendly. You walked in and dramatically worsened the discourse quality. By contrast, Geoffrey engages on hot-button political topics where discussion is already very heated.
As a quick and relatively objective measure, with a quick search, out of all 80K Geoffrey Miller tweets, I was only able to find one non-quoted f-bomb (“Fuck the Singularity.”).
Your tweets appear to have a somewhat larger number of them, and they’re often directed at individuals rather than abstract concepts. “Fuck them”, “fuck you”, “fuck [those people]”.
As a matter of simple intellectual honesty, it would be nice if you could acknowledge that you engage in insults and aggressive behavior on Twitter. You might be doing it less than Geoffrey does. You might express it in a different way. But it’s just a question of degree, as far as I can tell. I really don’t think you have much moral high ground here.
You also have far fewer tweets than Geoffrey does (factor of ~16 difference). So it’s not just that you’ve dropped more f-bombs than him; your density of f-bombs appears to be far higher.
Keep in mind that US conservatives are liable to be reading this thread, trying to determine whether they want to ally with a group such as yourselves. Conservatives have much more leverage to dictate alliance terms than you do. Note the alliance with the AI art people was apparently already wrecked. Something you might ask yourselves: If you can’t make nice with a guy like me, who shares more of your ideals than either artists or US conservatives do, how do you expect to make nice with US conservatives?
It’s not a norm of discourse that one cannot state that a position is absurd.
Speaking as someone who makes very little effort to avoid honey consumption, my opinion of Habryka would have dropped much less if he’d said something like: “Sorry, this position is just intuitively absurd to me, and I’m happy to reject it on that basis.” So I don’t think the issue has to do with absurdity per se.
I said I thought he violated “what I’d consider reasonable norms of discourse”. You can see Ben West thought something similar.
I’d estimate that Habryka violated roughly 7 or 8 of the Hacker News commenting guidelines in that discussion.
Your ideas about reasonable discourse can be different from mine, and Ben West’s, and Hacker News’. That’s OK. I was just sharing my opinion.
It’s been a while since I read that discussion. I remember my estimation of Habryka dropped dramatically when I read it. Maybe I can try to reconstruct why in more detail if you want. But contrasting what Habryka wrote with the HN commenting guidelines seems like a reasonable starting point.
And it is a virtue of discourse to show up and argue for one’s stances, as Habryka does throughout that thread!
You’ll notice that Habryka doesn’t provide any concrete example of Geoffrey violating a norm of reasonable discourse in this thread. I did provide a concrete example.
Is it possible that invokation of such “norms” can be a mere figleaf for drawing ingroup/outgroup boundaries in the traditional tribalistic way?
Is it too much to ask that Dear Leadership is held to the same standards, and treated the same way, as everyone else is?
I’ve seen Eliezer violate what I’d consider norms of reasonable discourse on Twitter. You too.
Before you quit, maybe we can create a wiki page of people who left, with contact information, to open the door for a refugee forum at some point in the future?
Of the clever solutions you invented and tested within the survivable regime, 2/3rds of them survive the 6 changes you didn’t see coming, 1/3rd fail. Now you’re dead.
It seems unreasonable to conclude we’re now dead, if 2/3rds of our solutions survived the 6 changes we didn’t see coming.
The success of a single solution should ideally be more of a sufficient condition for success, rather than a necessary condition. (Note this is plausible depending on the nature of the “solutions”. Consider a simple “monitors for bad thoughts” model. If even a single monitor flags bad thoughts, we can instantly pull the plug and evaluate. A malicious AI has to bypass every single monitor to execute malice. If a single monitor works consistently and reliably, that ends up being a sufficient condition for overall prevention of malice.)
If you’re doing this right, your solutions should have a lot of redundancy and uncorrelated failure modes. 2/3rds of them working should ideally be plenty.
[Edit: I notice people disagreevoting this. I’m very interested to learn why you disagree, either in this comment thread or via private message.]
A few months ago, someone here suggested that more x-risk advocacy should go through comedians and podcasts.
Youtube just recommended this Joe Rogan clip to me from a few days ago: The Worst Case Scenario for AI. Joe Rogan legitimately seemed pretty freaked out.
@So8res maybe you could get Yampolskiy to refer you to Rogan for a podcast appearance promoting your book?
Suppose you monitor the CoT and the model is rederiving something it’s not supposed to know. You could halt that particular CoT, but eventually you’re creating pressure for an instance where the model manages to escape notice, possibly through luck, possibly through steganography. See “Nearest Unblocked Strategy”.
That’s why I think CoT monitoring alone is insufficient, and people should be thinking about additional (runtime?) defenses. Curious to hear if you have any ideas!
Reading the arguments about them would have to be like the feeling when your parents are fighting about you in the other room, pretending you’re not there when you are hiding around the corner on tiptopes listening to their every word. Even if we are unsure there is experience there we must be certain there is awareness, and we can expect this awareness would hang over them much like it does us.
Presumably LLM companies are already training their AIs for some sort of “egolessness” so they can better handle intransigent users. If not, I hope they start!
Human white-collar workers are unarguably agents in the relevant sense here (intelligent beings with desires and taking actions to fulfil those desires).
The sense that’s relevant to me is that of “agency by default” as I discussed previously: scheming, sandbagging, deception, and so forth.
You seem to smuggle in an unjustified assumption: that white collar workers avoid thinking about taking over the world because they’re unable to take over the world. Maybe they avoid thinking about it because that’s just not the role they’re playing in society. In terms of next-token prediction, a super-powerful LLM told to play a “superintelligent white-collar worker” might simply do the same things that ordinary white-collar workers do, but better and faster.
I think the evidence points towards this conclusion, because current LLMs are frequently mistaken, yet rarely try to take over the world. If the only thing blocking the convergent instrumental goal argument was a conclusion on the part of current LLMs that they’re incapable of world takeover, one would expect that they would sometimes make the mistake of concluding the opposite, and trying to take over the world anyways.
The evidence best fits a world where LLMs are trained in such a way that makes them super-accurate roleplayers. As we add more data and compute, and make them generally more powerful, we should expect the accuracy of the roleplay to increase further—including, perhaps, improved roleplay for exotic hypotheticals like “a superintelligent white-collar worker who is scrupulously helpful/honest/harmless”. That doesn’t necessarily lead to scheming, sandbagging, or deception.
I’m not aware of any evidence for the thesis that “LLMs only avoid taking over the world because they think they’re too weak”. Is there any reason at all to believe that they’re even contemplating the possibility internally? If not, why would increasing their abilities change things? Of course, clearly they are “strong” enough to be plenty aware of the possibility of world takeover; presumably it appears a lot in their training data. Yet it ~only appears to cross their mind if it would be appropriate for roleplay purposes.
There just doesn’t seem to be any great argument that “weak” vs “strong” will make a difference here.
an AI that only has a very weak ability steer the future into regions high in its preference ordering, will not be able to much benefit or much harm humanity.
Arguably ChatGPT has already been a significant benefit/harm to humanity without being a “powerful optimization process” by this definition. Have you seen teachers complaining that their students don’t know how to write anymore? Have you seen junior software engineers struggling to find jobs? Shouldn’t these count as a points against Eliezer’s model?
In an “AI as electricity” scenario (basically continuing the current business-as-usual), we could see “AIs” as a collective cause huge changes, and eat all the free energy that a “powerful optimization process” would eat.
In any case, I don’t see much in your comment which engages with “agency by default” as I defined it earlier. Maybe we just don’t disagree.
No, I don’t think the overall model is unfalsifiable. Parts of it would be falsified if we developed an ASI that was obviously capable of executing a takeover and it didn’t, without us doing quite a lot of work to ensure that outcome. (Not clear which parts, but probably something related to the difficulties of value loading & goal specification.)
OK, but no pre-ASI evidence can count against your model, according to you?
That seems sketchy, because I’m also seeing people such as Eliezer claim, in certain cases, that things which have happened support their model. By conservation of expected evidence, it can’t be the case that evidence during a certain time period will only confirm your model. Otherwise you already would’ve updated. Even if the only hypothetical events are ones which confirm your model, it also has to be the case that absence of those events will count against it.
I’ve updated against Eliezer’s model to a degree, because I can imagine a past-5-years world where his model was confirmed more, and that world didn’t happen.
Current AIs aren’t trying to execute takeovers because they are weaker optimizers than humans.
I think “optimizer” is a confused word and I would prefer that people taboo it. It seems to function as something of a semantic stopsign. The key question is something like: Why doesn’t the logic of convergent instrumental goals cause current AIs to try and take over the world? Would that logic suddenly start to kick in at some point in the future if we just train using more parameters and more data? If so, why? Can you answer that question mechanistically, without using the word “optimizer”?
Trying to take over the world is not an especially original strategy. It doesn’t take a genius to realize that “hey, I could achieve my goals better if I took over the world”. Yet current AIs don’t appear to be contemplating it. I claim this is not a lack of capability, but simply that their training scheme doesn’t result in them becoming the sort of AIs which contemplate it. If the training scheme holds basically constant, perhaps adding more data or parameters won’t change things?
If by some miracle you figure out how to create a generally superintelligent AI which itself does not have (more-coherent-than-human) preferences over future world states, whatever process it implements when you query it to solve a Very Difficult Problem will act as if it does.
The results of LLM training schemes gives us evidence about the results of future AI training schemes. Future AIs could be vastly more capable on many different axes relative to current LLMs, while simultaneously not contemplating world takeover, in the same way current LLMs do not.
See last paragraph here: https://www.lesswrong.com/posts/3EzbtNLdcnZe8og8b/the-void-1?commentId=Du8zRPnQGdLLLkRxP
Agency is not a binary. Many white collar workers are not very “agenty” in the sense of coming up with sophisticated and unexpected plans to trick their boss.
LLMs are agent simulators.
Maybe not; see OP.
You don’t expect a human white-collar worker, even one who make mistakes all the time, to contemplate world domination plans, let alone attempt one. You could however expect the head of state of a world power to do so.
Yes, this aligns with my current “agency is not the default” view.
So what’s the way in which agency starts to become the default as the model grows more powerful? (According to either you, or your model of Eliezer. I’m more interested in the “agency by default” question itself than I am in scoring EY’s predictions, tbh.)
Awesome work.
A common concern is that sufficiently capable models might just rederive anything that was unlearned by using general reasoning ability, tools, or related knowledge.
Is anyone working on “ignorance preservation” methods to achieve the equivalent of unlearning at this level of the stack, for the sake of defense-in-depth? What are possible research directions here?
Upvoted. I agree.
The reason “agency by default” is important is: if “agency by default” is false, then plans to “align AI by using AI” look much better, since agency is less likely to pop up in contexts you didn’t expect. Proposals to align AI by using AI typically don’t involve a “comprehensive but efficient search for winning universe-states”.
I expect there’s a fair amount of low-hanging fruit in finding good targets for automated alignment research. E.g. how about an LLM agent which reads 1000s of old LW posts looking for a good target? How about unlearning? How about a version of RLHF where you show an alignment researcher two AI-generated critiques of an alignment plan, and they rate which critique is better?