Nora Belrose

Karma: 591

Nora Belrose 14 Jun 2024 22:36 UTC
2 points
7
in reply to: cubefox’s comment on: My AI Model Delta Compared To Yudkowsky
It’s not clear that this is undesired behavior from the perspective of OpenAI. They aren’t actually putting GPT in a situation where it will make high-stakes decisions, and upholding deontological principles seems better from a PR perspective than consequentialist reasoning in these cases.

Nora Belrose 11 Jun 2024 18:27 UTC
9 points
−13
on: My AI Model Delta Compared To Yudkowsky
In worlds where natural abstraction basically fails, we are thoroughly and utterly fucked
I’m not really sure what it would mean for the natural abstraction hypothesis to turn out to be true, or false. The hypothesis itself seems insufficiently clear to me.
On your view, if there are no “natural abstractions,” then we should predict that AIs will “generalize off-distribution” in ways that are catastrophic for human welfare. Okay, fine. I would prefer to just talk directly about the probability that AIs will generalize in catastrophic ways. I don’t see any reason to think they will, and so maybe in your ontology that means I must accept the natural abstraction hypothesis. But in my ontology, there’s no “natural abstraction hypothesis” link in the logical chain, I’m just directly applying induction to what we’ve seen so far about AI behavior.

Nora Belrose 23 May 2024 17:01 UTC
1 point
0
on: Some quick thoughts on “AI is easy to control”
I responded in this Twitter thread.

Nora Belrose 10 May 2024 23:36 UTC
2 points
6
in reply to: Andy Arditi’s comment on: Refusal in LLMs is mediated by a single direction
I do think that “orthogonalizing the weight matrices with respect to direction $^r$ ” is the clearest way of describing this method.
I do respectfully disagree here. I think the verb “orthogonalize” is just confusing. I also don’t think the distinction between optimization and no optimization is very important. What you’re actually doing is orthogonally projecting the weight matrices onto the orthogonal complement of the direction.

Nora Belrose 3 May 2024 4:00 UTC
5 points
1
on: Refusal in LLMs is mediated by a single direction
Nice work! Since you cite our LEACE paper, I was wondering if you’ve tried burning LEACE into the weights of a model just like you burn an orthogonal projection into the weights here? It should work at least as well, if not better, since LEACE will perturb the activations less.
Nitpick: I wish you would use a word other than “orthogonalization” since it sounds like you’re saying that you’re making the weight matrix an orthogonal matrix. Why not LoRACS (Low Rank Adaptation Concept Erasure)?

Nora Belrose 2 May 2024 19:53 UTC
1 point
0
in reply to: AviS’s comment on: Counting arguments provide no evidence for AI doom
Unless you think transformative AI won’t be trained with some variant of SGD, I don’t see why this objection matters.
Also, I think the a priori methodological problems with counting arguments in general are decisive. You always need some kind of mechanistic story for why a “uniform prior” makes sense in a particular context, you can’t just assume it.

Nora Belrose 22 Apr 2024 12:23 UTC
1 point
0
in reply to: Thomas Kwa’s comment on: What’s with all the bans recently?
I don’t know what caused it exactly, and it seems like I’m not rate limited anymore.

Nora Belrose 20 Apr 2024 6:58 UTC
7 points
0
in reply to: Thomas Kwa’s comment on: What’s with all the bans recently?
If moderators started rate-limiting Nora Belrose or someone else whose work I thought was particularly good
I actually did get rate-limited today, unfortunately.

Nora Belrose 19 Apr 2024 23:55 UTC
21 points
14
on: Inducing Unprompted Misalignment in LLMs
Unclear why this is supposed to be a scary result.
“If prompting a model to do something bad generalizes to it being bad in other domains, this is also evidence for the idea that prompting a model to do something good will generalize to it doing good in other domains”—Matthew Barnett
What links here?
- AI #61: Meta Trouble by Zvi (2 May 2024 18:40 UTC; 29 points)

Nora Belrose 14 Mar 2024 16:42 UTC
14 points
11
in reply to: Charlie Steiner’s comment on: Deconstructing Bostrom’s Classic Argument for AI Doom
Yeah, I think Evan is basically opportunistically changing his position during that exchange, and has no real coherent argument.

Nora Belrose 12 Mar 2024 0:07 UTC
1 point
−7
in reply to: Charlie Steiner’s comment on: Deconstructing Bostrom’s Classic Argument for AI Doom
I do think that Solomonoff-flavored intuitions motivate much of the credence people around here put on scheming. Apparently Evan Hubinger puts a decent amount of weight on it, because he kept bringing it up in our discussion in the comments to Counting arguments provide no evidence for AI doom.

Nora Belrose 11 Mar 2024 11:15 UTC
2 points
3
in reply to: ryan_greenblatt’s comment on: Deconstructing Bostrom’s Classic Argument for AI Doom

The strong version as defined by Yudkowsky… is pretty obvious IMO

I didn’t expect you’d say that. In my view it’s pretty obviously false. Knowledge and skills are not value-neutral, and some goals are a lot harder to instill into an AI than others bc the relevant training data will be harder to come by. Eliezer is just not taking into account data availability whatsoever, because he’s still fundamentally thinking about things in terms of GOFAI and brains in boxes in basements rather than deep learning. As Robin Hanson pointed out in the foom debate years ago, the key component of intelligence is “content.” And content is far from value neutral.

Nora Belrose 11 Mar 2024 7:48 UTC
2 points
−8
in reply to: ryan_greenblatt’s comment on: Deconstructing Bostrom’s Classic Argument for AI Doom
As I argue in the video, I actually think the definitions of “intelligence” and “goal” that you need to make the Orthogonality Thesis trivially true are bad, unhelpful definitions. So I both think that it’s false, and even if it were true it’d be trivial.
I’ll also note that Nick Bostrom himself seems to be making the motte and bailey argument here, which seems pretty damning considering his book was very influential and changed a lot of people’s career paths, including my own.
Edit replying to an edit you made: I mean, the most straightforward reading of Chapters 7 and 8 of Superintelligence is just a possibility-therefore-probability fallacy in my opinion. Without this fallacy, there would be little need to even bring up the orthogonality thesis at all, because it’s such a weak claim.

Deconstructing Bostrom’s Classic Argument for AI Doom

Nora Belrose11 Mar 2024 5:58 UTC

16 points

14 comments1 min readLW link

(www.youtube.com)

Nora Belrose 7 Mar 2024 7:16 UTC
1 point
−2
in reply to: Quintin Pope’s comment on: Counting arguments provide no evidence for AI doom
If it’s spontaneous then yeah, I don’t expect it to happen ~ever really. I was mainly thinking about cases where people intentionally train models to scheme.

Nora Belrose 7 Mar 2024 2:59 UTC
1 point
0
in reply to: Noosphere89’s comment on: Counting arguments provide no evidence for AI doom
What do you mean “hugely edited”? What other things would you like us to change? If I were starting from scratch I would of course write the post differently but I don’t think it would be worth my time to make major post hoc edits; I would like to focus on follow up posts.

Nora Belrose 5 Mar 2024 20:26 UTC
3 points
4
in reply to: Algon’s comment on: Counting arguments provide no evidence for AI doom
Isn’t Evan giving you what he thinks is a valid counting argument i.e. a counting argument over parameterizations?
Where is the argument? If you run the counting argument in function space, it’s at least clear why you might think there are “more” schemers than saints. But if you’re going to say there are “more” params that correspond to scheming than there are saint-params, that looks like a substantive empirical claim that could easily turn out to be false.

Nora Belrose 5 Mar 2024 7:10 UTC
4 points
2
in reply to: evhub’s comment on: Counting arguments provide no evidence for AI doom
It’s not clear to me what an “algorithm” is supposed to be here, and I suspect that this might be cruxy. In particular I suspect (40-50% confidence) that:
- You think there are objective and determinate facts about what “algorithm” a neural net is implementing, where
- Algorithms are supposed to be something like a Boolean circuit or a Turing machine rather than a neural network, and
- We can run counting arguments over these objective algorithms, which are distinct both from the neural net itself and the function it expresses.
I reject all three of these premises, but I would consider it progress if I got confirmation that you in fact believe in them.

Nora Belrose 5 Mar 2024 7:00 UTC
1 point
−9
in reply to: TurnTrout’s comment on: Counting arguments provide no evidence for AI doom
I’m sorry to hear that you think the argumentation is weaker now.
the reader has to do the work to realize that indifference over functions is inappropriate
I don’t think that indifference over functions in particular is inappropriate. I think indifference reasoning in general is inappropriate.
I’m very happy with running counting arguments over the actual neural network parameter space
I wouldn’t call the correct version of this a counting argument. The correct version uses the actual distribution used to initialize the parameters as a measure, and not e.g. the Lebesgue measure. This isn’t appealing to the indifference principle at all, and so in my book it’s not a counting argument. But this could be terminological.

Nora Belrose 5 Mar 2024 3:13 UTC
1 point
−4
in reply to: ryan_greenblatt’s comment on: Counting arguments provide no evidence for AI doom
Fair enough if you never read any of these comments.
Yeah, I never saw any of those comments. I think it’s obvious that the most natural reading of the counting argument is that it’s an argument over function space (specifically, over equivalence classes of functions which correspond to “goals.”) And I also think counting arguments for scheming over parameter space, or over Turing machines, or circuits, or whatever, are all much weaker. So from my perspective I’m attacking a steelman rather than a strawman.

Nora Belrose

De­con­struct­ing Bostrom’s Clas­sic Ar­gu­ment for AI Doom

Deconstructing Bostrom’s Classic Argument for AI Doom