I think almost all of these are things that I’d only think after I’d already noticed confusion, and most are things I’d never say in my head anyway. A little way into the list I thought “Wait, did he just ask ChatGPT for different ways to say “I’m confused”?”.
I expect there are things that pop up in my inner monologue when I’m confused about something, that I wouldn’t notice, and it would be very useful to have a list of such phrases, but your list contains ~none of them.
Edit: Actually the last three are reasonable. Are they human written?
One way of framing the difficulty with the lanternflies thing is that the question straddles the is-ought gap. It decomposes pretty cleanly into two questions: “What states of the universe are likely to result from me killing vs not killing lanternflies” (about which Bayes Rule fully applies and is enormously useful), and “Which states of the universe do I prefer?”, where the only evidence you have will come from things like introspection about your own moral intuitions and values. Your values are also a fact about the universe, because you are part of the universe, so Bayes still applies I guess, but it’s quite a different question to think about.If you have well defined values, for example some function from states (or histories) of the universe to real numbers, such that larger numbers represent universe states that you would always prefer over smaller numbers, then every “should I do X or Y” question has an answer in terms of those values. In practice we’ll never have that, but still it’s worth thinking separately about “What are the expected consequences of the proposed policy?” and “What consequences do I want”, which a ‘should’ question implicitly mixes together.
I’ve always thought of it like, it doesn’t rely on the universe being computable, just on the universe having a computable approximation.
So if the universe is computable, SI does perfectly, if it’s not, SI does as well as any algorithm could hope to.
A slightly surreal experience to read a post saying something I was just tweeting about, written by a username that could plausibly be mine.
Do we even need a whole new term for this? Why not “Sudden Deceptive Alignment”?
I think in some significant subset of such situations, almost everyone present is aware of the problem, so you don’t always have to describe the problem yourself or explicitly propose solutions (which can seem weird from a power dynamics perspective). Sometimes just drawing the group’s attention to the meta level at all, initiating a meta-discussion, is sufficient to allow the group to fix the problem.
This is good and interesting. Various things to address, but I only have time for a couple at random.
I disagree with the idea that true things necessarily have explanations that are both convincing and short. In my experience you can give a short explanation that doesn’t address everyone’s reasonable objections, or a very long one that does, or something in between. If you understand some specific point about cutting edge research, you should be able to properly explain it to a lay person, but by the time you’re done they won’t be a lay person any more! If you restrict your explanation to “things you can cover before the person you’re explaining to decides this isn’t worth their time and goes away”, many concepts simply cannot ever be explained to most people, because they don’t really want to know.
So the core challenge is staying interesting enough for long enough to actually get across all of the required concepts. On that point, have you seen any of my videos, and do you have thoughts on them? You can search “AI Safety” on YouTube.
Similarly, do you thoughts on AISafety.info ?
Are we not already doing this? I thought we were already doing this. See for example this talk I gave in 2018
I guess we can’t be doing it very well though
Structured time boxes seem very suboptimal, steamrollering is easy enough to deal with by a moderator “Ok let’s pause there for X to respond to that point”
This would make a great YouTube series
Edit: I think I’m going to make this a YouTube series
Other tokens that require modelling more than a human:
The results sections of scientific papers—requires modelling whatever the experiment was about. If humans could do this they wouldn’t have needed to run the experiment
Records of stock price movements—in principle getting zero loss on this requires insanely high levels of capability
Compare with this from Meditations on Moloch:
Imagine a country with two rules: first, every person must spend eight hours a day giving themselves strong electric shocks. Second, if anyone fails to follow a rule (including this one), or speaks out against it, or fails to enforce it, all citizens must unite to kill that person. Suppose these rules were well-enough established by tradition that everyone expected them to be enforced. So you shock yourself for eight hours a day, because you know if you don’t everyone else will kill you, because if they don’t, everyone else will kill them, and so on.
Seems to me a key component here, which flows naturally from “punish any deviation from the profile” is this pattern of ‘punishment of non-punishers’.
The historical trends thing is prone to standard reference class tennis. Arguments like “Every civilization has collapsed, why would ours be special? Something will destroy civilisation, how likely is it that it’s AI?”. Or “almost every species has gone extinct. Something will wipe us out, could it be AI?”. Or even “Every species in the genus homo has been wiped out, and the overwhelmingly most common cause is ‘another species in the genus homo’, so probably we’ll do it to ourselves. What methods do we have available?”.
These don’t point to AI particularly, they remove the unusual-seemingness of doom in general
Oh, I missed that! Thanks. I’ll delete I guess.
I think there’s also a third thing that I would call steelmanning, which is a rhetorical technique I sometimes use when faced with particularly bad arguments. If strawmanning introduces new weaknesses to an argument and then knocks it down, steelmanning fixes weaknesses in an argument and then knocks it down anyway. It looks like “this argument doesn’t work because X assumption isn’t true, but you could actually fix that like this so you don’t need that assumption. But it still doesn’t work because of Y, and even if you fix that by such and such, it all still fails because of Z”.
You’re kind of skipping ahead in the debate, doing your opponent’s job of fixing up their argument as it’s attacked, and showing that the argument is too broken to fix up.
This is not a very nice way to act, it’s not truth seeking, and you’d better be damn sure that you’re right, and make sure to actually repair the argument well rather than just putting on a show of it. But done right, in a situation that calls for it, it can produce a very powerful effect. This should probably have a different name, but I still think of it as making and then knocking down a steel man.
The main reason I find this kind of thing concerning is that I expect this kind of model to be used as part of a larger system, for example the descendants of systems like SayCan. In that case you have the LLM generate plans in response to situations, break the plans down into smaller steps, and eventually pass the steps to a separate system that translates them to motor actions. When you’re doing chain-of-thought reasoning and explicit planning, some simulacrum layers are collapsed—having the model generate the string “kill this person” can in fact lead to it killing the person.
This would be extremely undignified of course, since the system is plotting to kill you in plain-text natural language. It’s very easy to catch such things with something as simple as an LLM that’s prompted to look at the ongoing chain of thought and check if it’s planning to do anything bad. But you can see how unreliable that is at higher capability levels. And we may even be that undignified in practice, since running a second model on all the outputs ~doubles the compute costs.
Makes sense. I guess the thing to do is bring it to some bio-risk people in a less public way