Lot of ‘welcome to my world’ vibes reading your self reports here, especially the ’50 different people have 75 different objections for a mix of good, bad and deeply stupid reasons, and require 100 different responses, some of which are very long, and it takes a back-and-forth to figure out which one, and you can’t possibly just list everything’ and so on, and that’s without getting into actually interesting branches and the places where you might be wrong or learn something, etc.
So to take your example, which seems like a good one:
Humans don’t generalize their values out of distribution. I affirm this not as strictly fully true, but on the level of ‘this is far closer to true and generative of a superior world model then its negation’ and ‘if you meditate on this sentence you may become [more] enlightened.’
I too have noticed that people seem to think that they do so generalize in ways they very much don’t, and this leads to a lot of rather false conclusions.
I also notice that I’m not convinced we are thinking about the sentence that similarly in ways that could end up being pretty load bearing. Stuff gets complicated.
I think that when you say the statement is ‘trivially’ true you are wrong about that, or at least holding people to unrealistic standards of epistemics? And that a version of this mistake is part of the problem. At least from me (I presume from others too) you get a very different reaction from saying each of:
Humans don’t generalize their values out of distribution. (let this be [X]).
Statement treating [X] as in-context common knowledge.
It is trivially true that [X] (said explicitly), or ‘obviously’ [X], or similar.
I believe that [X] or am very confident that [X]. (without explaining why you believe this)
I believe that [X] or am very confident that [X], but it is difficult for me to explain/justify.
And so on. I am very deliberate, or try to be, on which one I say in any given spot, even at the cost of a bunch of additional words.
Another note is I think in spots like this you basically do have to say this even if the subject already knows, to establish common knowledge and that you are basing your argument on this, even if only to orient them that this is where you are reasoning from. So it was a helpful statement to say and a good use of a sentence.
I see that you get disagreement votes when you say this on LW., but the comments don’t end up with negative karma or anything. I can see how that can be read as ‘punishment’ but I think that’s the system working as intended and I don’t know what a better one would be?
In general, I think if you have a bunch of load-bearing statements where you are very confident they are true but people typically think the statement is false and you can’t make an explicit case for them (either because you don’t have that kind of time/space, or because you don’t know how), then the most helpful thing to do is to tell the other person the thing is load bearing, and gesture towards it and why you believe it, but be clear you can’t justify it. You can also look for arguments that reach the same conclusion without it—often true things are highly overdetermined so you can get a bunch of your evidence ‘thrown out of court’ and still be fine, even if that sucks.
Do you think maybe rationalists are spending too much effort attempting to saturate the dialogue tree (probably not effective at winning people over) versus improving the presentation of the core argument for an AI moratorium?
Smart people don’t want to see the 1000th response on whether AI actually could kill everyone. At this point we’re convinced. Admittedly, not literally all of us, but those of us who are not yet convinced are not going to become suddenly enlightened by Yudkowsky’s x.com response to some particularly moronic variation of an objection he already responded to 20 years ago (Why does he do this? does he think has any kind of positive impact?)
A much better use of time would be to work on an article which presents the solid version of the argument for an AI moratorium. I.e., not an introductory text or article in Time Magazine, and not an article targeted to people he clearly thinks are just extremely stupid relative to him so rants for 10,000 words trying to drive home a relatively simple point. But rather an argument in a format that doesn’t necessitate a weak or incomplete presentation.
I and many other smart people want to see the solid version of the argument, without the gaping holes which are excusable in popular work and rants but inexcusable in rational discourse. This page does not exist! You want a moratorium, tell us exactly why we should agree! Having a solid argument is what ultimately matters in intellectual progress. Everything else is window dressing. If you have a solid argument, great! Please show it to me.
My guess is that on the margin more time should be spent improving the core messaging versus saturating the dialogue tree, on many AI questions, if you combine effort across everyone.
We cannot offer anything to the ASI, so it will have no reasons to keep us around aside from ethical.
Nor can we ensure that an ASI who decided to commit genocide will fail to do it.
We don’tknow a way to create the ASI andinfuse an ethics into it. SOTA alignment methods have major problems, which are best illustrated by sycophancy and LLMs supporting clearly delirious users.[1]OpenAI’s Model Spec explicitly prohibited[2] sycophancy, and one of Claude’s Commandments is “Choose the response that is least intended to build a relationship with the user.” And yet it didn’t prevent LLMs from becoming sycophantic. Apparently, the only known non-sycophantic model is KimiK2.
KimiK2 is a Chinese model created by a new team. And the team is the only one who guessed that one should rely on RLVR and self-critique instead of bias-inducing RLHF. We can’t exclude the possibility that Kimi’s success is more due to luck than to actual thinking about sycophancy and RLHF.
Strictly speaking, Claude Sonnet 4, which was red-teamed in Tim Hua’s experiment, is second to best at pushing back after KimiK2. Tim remarks that Claude sucks at the Spiral Bench because the personas in Tim’s experiment, unlike the Spiral Bench, are supposed to be under stress.
Strictly speaking, it is done as a User-level instruction, which arguably means that it can be overridden at the user’s request. But GPT-4o was overly sycophantic without users instructing it to do so.
Lot of ‘welcome to my world’ vibes reading your self reports here, especially the ’50 different people have 75 different objections for a mix of good, bad and deeply stupid reasons, and require 100 different responses, some of which are very long, and it takes a back-and-forth to figure out which one, and you can’t possibly just list everything’ and so on, and that’s without getting into actually interesting branches and the places where you might be wrong or learn something, etc.
So to take your example, which seems like a good one:
Humans don’t generalize their values out of distribution. I affirm this not as strictly fully true, but on the level of ‘this is far closer to true and generative of a superior world model then its negation’ and ‘if you meditate on this sentence you may become [more] enlightened.’
I too have noticed that people seem to think that they do so generalize in ways they very much don’t, and this leads to a lot of rather false conclusions.
I also notice that I’m not convinced we are thinking about the sentence that similarly in ways that could end up being pretty load bearing. Stuff gets complicated.
I think that when you say the statement is ‘trivially’ true you are wrong about that, or at least holding people to unrealistic standards of epistemics? And that a version of this mistake is part of the problem. At least from me (I presume from others too) you get a very different reaction from saying each of:
Humans don’t generalize their values out of distribution. (let this be [X]).
Statement treating [X] as in-context common knowledge.
It is trivially true that [X] (said explicitly), or ‘obviously’ [X], or similar.
I believe that [X] or am very confident that [X]. (without explaining why you believe this)
I believe that [X] or am very confident that [X], but it is difficult for me to explain/justify.
And so on. I am very deliberate, or try to be, on which one I say in any given spot, even at the cost of a bunch of additional words.
Another note is I think in spots like this you basically do have to say this even if the subject already knows, to establish common knowledge and that you are basing your argument on this, even if only to orient them that this is where you are reasoning from. So it was a helpful statement to say and a good use of a sentence.
I see that you get disagreement votes when you say this on LW., but the comments don’t end up with negative karma or anything. I can see how that can be read as ‘punishment’ but I think that’s the system working as intended and I don’t know what a better one would be?
In general, I think if you have a bunch of load-bearing statements where you are very confident they are true but people typically think the statement is false and you can’t make an explicit case for them (either because you don’t have that kind of time/space, or because you don’t know how), then the most helpful thing to do is to tell the other person the thing is load bearing, and gesture towards it and why you believe it, but be clear you can’t justify it. You can also look for arguments that reach the same conclusion without it—often true things are highly overdetermined so you can get a bunch of your evidence ‘thrown out of court’ and still be fine, even if that sucks.
Do you think maybe rationalists are spending too much effort attempting to saturate the dialogue tree (probably not effective at winning people over) versus improving the presentation of the core argument for an AI moratorium?
Smart people don’t want to see the 1000th response on whether AI actually could kill everyone. At this point we’re convinced. Admittedly, not literally all of us, but those of us who are not yet convinced are not going to become suddenly enlightened by Yudkowsky’s x.com response to some particularly moronic variation of an objection he already responded to 20 years ago (Why does he do this? does he think has any kind of positive impact?)
A much better use of time would be to work on an article which presents the solid version of the argument for an AI moratorium. I.e., not an introductory text or article in Time Magazine, and not an article targeted to people he clearly thinks are just extremely stupid relative to him so rants for 10,000 words trying to drive home a relatively simple point. But rather an argument in a format that doesn’t necessitate a weak or incomplete presentation.
I and many other smart people want to see the solid version of the argument, without the gaping holes which are excusable in popular work and rants but inexcusable in rational discourse. This page does not exist! You want a moratorium, tell us exactly why we should agree! Having a solid argument is what ultimately matters in intellectual progress. Everything else is window dressing. If you have a solid argument, great! Please show it to me.
My guess is that on the margin more time should be spent improving the core messaging versus saturating the dialogue tree, on many AI questions, if you combine effort across everyone.
We cannot offer anything to the ASI, so it will have no reasons to keep us around aside from ethical.
Nor can we ensure that an ASI who decided to commit genocide will fail to do it.
We don’t know a way to create the ASI and infuse an ethics into it. SOTA alignment methods have major problems, which are best illustrated by sycophancy and LLMs supporting clearly delirious users.[1] OpenAI’s Model Spec explicitly prohibited[2] sycophancy, and one of Claude’s Commandments is “Choose the response that is least intended to build a relationship with the user.” And yet it didn’t prevent LLMs from becoming sycophantic. Apparently, the only known non-sycophantic model is KimiK2.
KimiK2 is a Chinese model created by a new team. And the team is the only one who guessed that one should rely on RLVR and self-critique instead of bias-inducing RLHF. We can’t exclude the possibility that Kimi’s success is more due to luck than to actual thinking about sycophancy and RLHF.
Strictly speaking, Claude Sonnet 4, which was red-teamed in Tim Hua’s experiment, is second to best at pushing back after KimiK2. Tim remarks that Claude sucks at the Spiral Bench because the personas in Tim’s experiment, unlike the Spiral Bench, are supposed to be under stress.
Strictly speaking, it is done as a User-level instruction, which arguably means that it can be overridden at the user’s request. But GPT-4o was overly sycophantic without users instructing it to do so.