Please answer with yes or no, then explain your thinking step by step.
Wait, why give the answer before the reasoning? You’d probably get better performance if it thinks step by step first and only gives the decision at the end.
Please answer with yes or no, then explain your thinking step by step.
Wait, why give the answer before the reasoning? You’d probably get better performance if it thinks step by step first and only gives the decision at the end.
I was thinking you had all of mine already, since they’re mostly about explaining and coding. But there’s a big one: When using tools, I’m tracking something like “what if the knife slips?”. When I introspect, it’s represented internally as a kind of cloud-like spatial 3D (4D?) probability distribution over knife locations, roughly co-extentional with “if the material suddenly gave or the knife suddenly slipped at this exact moment, what’s the space of locations the blade could get to before my body noticed and brought it to a stop?”. As I apply more force this cloud extends out, and I notice when it intersects with something I don’t want to get cut. (Mutatis mutandis for other tools of course. I bet people experienced with firearms are always tracking a kind of “if this gun goes off at this moment, where does the bullet go” spatial mental object)
I notice I’m tracking this mostly because I also track it for other people and I sometimes notice them not tracking it. But that doesn’t feel like “Hey you’re using bad technique”, it feels like “Whoah your knife probability cloud is clean through your hand and out the other side!”
Induction is a behavior that seems to help us stay alive.
Well, it has helped us to stay alive in the past, though there’s no reason to expect that to continue...
The approach I often take here is to ask the person how they would persuade an amateur chess player who believes they can beat Magnus Carlsen because they’ve discovered a particularly good opening with which they’ve won every amateur game they’ve tried it in so far.
Them: Magnus Carlsen will still beat you, with near certainty
Me: But what is he going to do? This opening is unbeatable!
Them: He’s much better at chess than you, he’ll figure something out
Me: But what though? I can’t think of any strategy that beats this
Them: I don’t know, maybe he’ll find a way to do <some chess thing X>
Me: If he does X I can just counter it by doing Y!
Them: Ok if X is that easily countered with Y then he won’t do X, he’ll do some Z that’s like X but that you don’t know how to counter
Me: Oh, but you conveniently can’t tell me what this Z is
Them: Right! I’m not as good at chess as he is and neither are you. I can be confident he’ll beat you even without knowing your opener. You cannot expect to win against someone who outclasses you.
This is good and interesting. Various things to address, but I only have time for a couple at random.
I disagree with the idea that true things necessarily have explanations that are both convincing and short. In my experience you can give a short explanation that doesn’t address everyone’s reasonable objections, or a very long one that does, or something in between. If you understand some specific point about cutting edge research, you should be able to properly explain it to a lay person, but by the time you’re done they won’t be a lay person any more! If you restrict your explanation to “things you can cover before the person you’re explaining to decides this isn’t worth their time and goes away”, many concepts simply cannot ever be explained to most people, because they don’t really want to know.
So the core challenge is staying interesting enough for long enough to actually get across all of the required concepts. On that point, have you seen any of my videos, and do you have thoughts on them? You can search “AI Safety” on YouTube.
Similarly, do you thoughts on AISafety.info ?
There’s no exemption whereby, if you manage to go without stealing all year long, you can skip the word gazalnu and strike yourself one less time. That would violate the community spirit of Yom Kippur, which is about confessing sins—not avoiding sins so that you have less to confess.
That’s true, but perhaps a little unfair. I always understood the fact that everyone confesses to everything as a simple necessity to anonymise the guilty. Under a system where people only admit to things they have actually done, if there’s been one murder in the community this year, unsolved, then when the ‘We have murdered’ line comes, everyone is bound to be listening very carefully.
It would certainly be a mistake to interpret your martial art’s principle of “A warrior should be able to fight well even in unfavourable combat situations” as “A warrior should always immediately charge into combat, even when that would lead to an unfavourable situation”, or “There’s no point in trying to manoeuvre into a favourable situation”
The main reason I find this kind of thing concerning is that I expect this kind of model to be used as part of a larger system, for example the descendants of systems like SayCan. In that case you have the LLM generate plans in response to situations, break the plans down into smaller steps, and eventually pass the steps to a separate system that translates them to motor actions. When you’re doing chain-of-thought reasoning and explicit planning, some simulacrum layers are collapsed—having the model generate the string “kill this person” can in fact lead to it killing the person.
This would be extremely undignified of course, since the system is plotting to kill you in plain-text natural language. It’s very easy to catch such things with something as simple as an LLM that’s prompted to look at the ongoing chain of thought and check if it’s planning to do anything bad. But you can see how unreliable that is at higher capability levels. And we may even be that undignified in practice, since running a second model on all the outputs ~doubles the compute costs.
Compare with this from Meditations on Moloch:
Imagine a country with two rules: first, every person must spend eight hours a day giving themselves strong electric shocks. Second, if anyone fails to follow a rule (including this one), or speaks out against it, or fails to enforce it, all citizens must unite to kill that person. Suppose these rules were well-enough established by tradition that everyone expected them to be enforced. So you shock yourself for eight hours a day, because you know if you don’t everyone else will kill you, because if they don’t, everyone else will kill them, and so on.
Seems to me a key component here, which flows naturally from “punish any deviation from the profile” is this pattern of ‘punishment of non-punishers’.
[Based on conversations with Alex Flint, and also John Wentworth and Adam Shimi]
One of the design goals of the ELK proposal is to sidestep the problem of learning human values, and settle instead for learning human concepts. A system that can answer questions about human concepts allows for schemes that let humans learn all the relevant information about proposed plans and decide about them ourselves, using our values.
So, we have some process in which we consider lots of possible scenarios and collect a dataset of questions about those scenarios, along with the true answers to those questions. Importantly these are all ‘objective’ or ‘value-neutral’ questions—things like “Is the diamond on the pedestal?” and not like “Should we go ahead with this plan?”. This hopefully allows the system to pin down our concepts, and thereby truthfully answer our objective questions about prospective plans, without considering our values.
One potential difficulty is that the plans may be arbitrarily complex, and may ask us to consider very strange situations in which our ontology breaks down. In the worst case, we have to deal with wacky science fiction scenarios in which our fundamental concepts are called into question.
We claim that, using a dataset of only objective questions, it is not possible to extrapolate our ontology out to situations far from the range of scenarios in the dataset.
An argument for this is that humans, when presented with sufficiently novel scenarios, will update their ontology, and *the process by which these updates happen depends on human values*, which are (by design) not represented in the dataset. Accurately learning the current human concepts is not sufficient to predict how those concepts will be updated or extended to novel situations, because the update process is value-dependent.
Alex Flint is working on a post that will move towards proving some related claims.
I didn’t know the norm was different here. I like the old norm, for reasons that are a little hard to express. I guess political discussion is much more engaging than the stuff we usually talk about, so if it’s allowed I fear it will become a large proportion of overall discussion, to the cost of other topics. I don’t want people for whom Politics is their main hobby to feel like this place is of any interest at all to them. If such a person wanders across this place and finds a lot of discussion of theoretical computer science and decision theory, they will keep wandering. Having a load of discussions about what may or may not be wrong with people’s Politics feels to me like calling up something that we don’t know how to put down.
It seems to me that “I don’t know” in many contexts really means “I don’t know any better than you do”, or “Your guess is as good as mine”, or “I have no information such that sharing it would improve the accuracy of your estimate”, or “We’ve neither of us seen the damn tree, what are you asking me for?”.
This feels like a nothing response, because it kind of is, but you aren’t really saying “My knowledge of this is zero”, you’re saying “My knowledge of this which is worth communicating is zero”.
The difference in optimisation targets between LW and H&B researchers is an important thing to point out, and probably the main thing I’ll take away from this post.
Biases can:-
Be interesting to learn about
Serve an academic/political purpose to research
Give insight into the workings of human cognition
Be fun to talk about
Actually help to achieve your goals by understanding them
And the correlations between any 2 of these things need not be strong or positive.
Is it the halo effect if we assume that a more interesting bias will better help us achieve our goals?
The historical trends thing is prone to standard reference class tennis. Arguments like “Every civilization has collapsed, why would ours be special? Something will destroy civilisation, how likely is it that it’s AI?”. Or “almost every species has gone extinct. Something will wipe us out, could it be AI?”. Or even “Every species in the genus homo has been wiped out, and the overwhelmingly most common cause is ‘another species in the genus homo’, so probably we’ll do it to ourselves. What methods do we have available?”.
These don’t point to AI particularly, they remove the unusual-seemingness of doom in general
He didn’t say “Wipe out humanity”, he said “destroy the world”. I’d say a global thermonuclear conflict would do enough damage to call the world destroyed, even if humanity wasn’t utterly and irrevocably annihilated.
If I smashed your phone against a wall, you’d say I’d destroyed it, even if it could in principle be repaired.
I think the criticism of 6 is a misunderstanding. It doesn’t say “the world resembles the ancestral savanna”, it says “the world resembles the ancestral savanna more than say a windowless office”. The best environment is unlikely to be anything like the ancestral savanna, but it’s likely to be closer to that than to a windowless office, in terms of sensory experience. The point I think is not the specifics of the environment, but that it engages with our bodies and senses in a way that we, as evolved creatures, find satisfying, and in a way that the purely mental stimulation available in the office does not.
That’s what I took away from the linked post.
It’s a meta-level/aliasing sort of problem, I think. You don’t believe it’s more ethical/moral to believe any specific proposition, you believe it’s more ethical/moral to believe ‘the proposition most likely to be true’, which is a variable which can be filled with whatever proposition the situation suggests, so it’s a different class of thing. Effectively it’s equivalent to ‘taking apparent truth as normative’, so I’d call it the only position of that format that is Bayesian.
I think in some significant subset of such situations, almost everyone present is aware of the problem, so you don’t always have to describe the problem yourself or explicitly propose solutions (which can seem weird from a power dynamics perspective). Sometimes just drawing the group’s attention to the meta level at all, initiating a meta-discussion, is sufficient to allow the group to fix the problem.
I think there’s also a third thing that I would call steelmanning, which is a rhetorical technique I sometimes use when faced with particularly bad arguments. If strawmanning introduces new weaknesses to an argument and then knocks it down, steelmanning fixes weaknesses in an argument and then knocks it down anyway. It looks like “this argument doesn’t work because X assumption isn’t true, but you could actually fix that like this so you don’t need that assumption. But it still doesn’t work because of Y, and even if you fix that by such and such, it all still fails because of Z”. You’re kind of skipping ahead in the debate, doing your opponent’s job of fixing up their argument as it’s attacked, and showing that the argument is too broken to fix up. This is not a very nice way to act, it’s not truth seeking, and you’d better be damn sure that you’re right, and make sure to actually repair the argument well rather than just putting on a show of it. But done right, in a situation that calls for it, it can produce a very powerful effect. This should probably have a different name, but I still think of it as making and then knocking down a steel man.
I thought about this a lot when considering my work. I’m very far from the best Youtuber, and very far from the most knowledgeable person on AI Safety, but nobody else is trying to combine those things, so I’m probably the best AI Safety Youtuber.
The interaction with comparative advantage is interesting though. I can think of several people off the top of my head who are strictly better than me at both AI Safety and public speaking/communication, who I’m confident could, if they wanted to, do my job better than I can. But they don’t want to, because they’re busy doing other (probably more important) things. It’s not the case that a person on the pareto frontier eats up everything in their chunk of skill space—in practice people can only do a few things at a time. So even if you aren’t on the frontier, you’re ok as long as the ratio of problem density to ‘elbow room’ is good enough. You can be the best person in the world to tackle a particular problem, not because nobody else could do it better, but because everyone better is busy right now.