I agree, there’s more reason to think LLMs might not just magically align.
That fact does not distinguish between the worlds I’m pointing out though.
Here are two worlds:
We’re hurdling toward AI doom.
We’re subconsciously terrified and are projecting our terror onto the world, causing us to experience our situation as hurdling toward AI doom, and creating a lot of agreement between one another since we’re converging on that strategy for projecting our subconscious terror.
There are precious few, if any, arguments of the form “But here’s a logical reason why we’re doomed!” that can distinguish between these two worlds. It doesn’t matter how sound the argument is. Its type signature isn’t cruxy.
I’m not claiming there’s no way to distinguish between these worlds, to be clear. I’m pointing out that the approach of “But this doomy reasoning is really, really compelling!” is simply not relevant as a distinguisher here.
The whole point of the OP is to suggest that we get clear on the distinction so that we can really tell in 2.5 years, and possibly pivot accordingly. But not to argue that we pivot that way now. (That might make sense, and it might turn out to be something we’ll later wish we’d figured out to do sooner. But I’m not trying to argue that here.)
You can literally try to find out how bad people feel. Do x-riskers feel bad? Do people feel worse once they get convinced AI x-risk is a thing, or better once they get convinced it isn’t?
This is the PANAS test, a simple mood inventory that’s a standard psychological instrument. If you don’t trust self-report, you can use heart rate or heart rate variability, which most smartwatches will measure.
Now, to some degree, there’s a confounding thing where people like you (very interested in exercise and psychology) might feel better than someone who doesn’t have that focus, and all things equal maybe people who focus on x-risk are less likely to focus on exercise/meditation/circling/etc.
So the other thing is, if people who believe in x-risk take up a mental/physical health practice, you can ask if it makes them feel better, and also if it makes them more likely to stop believing in x-risk.
I love this direction of inquiry. It’s tricky to get right because of lots of confounds. But I think something like it should be doable.
I love that this is a space where people care about this stuff.
I’m poorly suited to actually carry out these examinations on my own. But if someone does, or if someone wants to lead the charge and would like me to help design the study, I’d love to hear about it!
The worlds you want to distinguish are those with short term doom and those without. Subconscious terror is the pesky confounder, not the thing you (in the end) want to distinguish. And it can be present or absent in either of the worlds you really want to distinguish, you can have doom+terror, or nodoom+terror, or doom+noterror. In the doom+terror world, distinguishing doom from terror can become a false dichotomy, if the issue with the arguments is not formulated more carefully than just saying it like that.
So the goal should be to figure out how to control for terror (if it’s indeed an important factor), and an argument shouldn’t be “distinguishing between doom and terror”, it should still be distinguishing between doom and no (short term) doom, possibly accounting for the terror factor in the process.
The pesky thing is that we can’t fully analyze it from the outside. The analysis itself can be colored by a terror generator that has nothing to do with the objective situation.
So if there’s reason to think there’s a subconscious distortion happening to our collective reasoning, distinguishing between doom and nodoom might functionally have sorting out terror as a prerequisite. Which sucks in terms of “If there’s doom then we don’t want to waste time working on things that aren’t related.” But if we literally cannot tell what’s real due to distortions in perception, then sorting out those perception errors becomes the top priority.
(I’m describing it in black-and-white framing to make the logic clear, not to assert that we literally cannot tell at all what’s going on.)
When you can’t figure something out, you need to act under uncertainty. The question is still doom vs. no short term doom. Even if you conclude “terror”, that is only an argument for the uncertainty being unresolvable (with some class of arguments that would otherwise help), not an argument that doom has been ruled out (there still needs to be some prior). The “doom vs. terror” framing doesn’t adequately capture this.
Since 5-20% doom within 10 years is a relatively popular position, mixing in more nodoom because terror made certain doom vs. nodoom arguments useless doesn’t change this state of uncertainty too much, the decision relevant implications remain about the same. That is, it’s still worth working on a mixture of short term and longer term projects, possibly even very long term ones that almost inevitably won’t be fruitful before takeoff-capable AGI (because we might use the head start to prompt the AGIs into completing such projects before takeoff).
There are precious few, if any, arguments of the form “But here’s a logical reason why we’re doomed!” that can distinguish between these two worlds.
Feels like that’s a Motte and Bailey?
Sure, if you believe that the claimed psychological effects are very, very strong, then that’s the case.
However, let’s suppose you don’t believe they’re quite that strong. Even if you believe that these effects are pretty strong, then sufficiently strong arguments may still provide decent evidence.
In terms of why the evidence I provided is strong:
There’s a well known phenomenon of “cranks” where non-experts end up believing in arguments that sound super persuasive to them, but which are obviously false to experts. The CAIS letter ruled that out.
It’s well-known that it’s easy to construct theoretical arguments that sound extremely persuasive, but bear no relation to how things work in practise. Having a degree of empirical evidence of some of the core claims greatly weakens these arguments.
So it’s not just that these are strong arguments. It’s that these are arguments that you might expect to provide some signal even if you thought the claimed effect was strong, but not overwhelmingly so.
I really don’t think so, but I’m not sure why you’re saying so, so maybe I’m missing something. If I keep doing something that looks to you like a motte-and-bailey, could you point out specifically the structure? Like, what looks my motte and what looks like my bailey?
So it’s not just that these are strong arguments. It’s that these are arguments that you might expect to provide some signal even if you thought the claimed effect was strong, but not overwhelmingly so.
Sure, but arguing that doom is real does nothing to say what proportion of the doom models’ spread is due to something akin to the trauma model. And that matters because… well, there’s a different motte-and-bailey structure that can go “OMG it’s SO BAD… but look, here’s the very reasonable argument for how it’s going to be this particular shape of challenging for us… SO IT’S SUPER BAD!!!” It’ll tend to warp how we see those otherwise reasonable arguments, and which ones get emphasized, and how.
Listing more such arguments, or illustrating how compelling they are, doesn’t do anything to suggest how much this warping phenomenon is or isn’t happening.
I mean, by analogy: someone can have a panic attack due to being trapped in a burning building. They’re going to have a hard time doing anything about their situation while having a panic attack. Arguments that they really truly are in a burning building don’t affect the truth of that point, regardless of how compelling those arguments that the building really is on fire are.
There’s a logic error I keep seeing here.
I agree, there’s more reason to think LLMs might not just magically align.
That fact does not distinguish between the worlds I’m pointing out though.
Here are two worlds:
We’re hurdling toward AI doom.
We’re subconsciously terrified and are projecting our terror onto the world, causing us to experience our situation as hurdling toward AI doom, and creating a lot of agreement between one another since we’re converging on that strategy for projecting our subconscious terror.
There are precious few, if any, arguments of the form “But here’s a logical reason why we’re doomed!” that can distinguish between these two worlds. It doesn’t matter how sound the argument is. Its type signature isn’t cruxy.
I’m not claiming there’s no way to distinguish between these worlds, to be clear. I’m pointing out that the approach of “But this doomy reasoning is really, really compelling!” is simply not relevant as a distinguisher here.
The whole point of the OP is to suggest that we get clear on the distinction so that we can really tell in 2.5 years, and possibly pivot accordingly. But not to argue that we pivot that way now. (That might make sense, and it might turn out to be something we’ll later wish we’d figured out to do sooner. But I’m not trying to argue that here.)
You can literally try to find out how bad people feel. Do x-riskers feel bad? Do people feel worse once they get convinced AI x-risk is a thing, or better once they get convinced it isn’t?
This is the PANAS test, a simple mood inventory that’s a standard psychological instrument. If you don’t trust self-report, you can use heart rate or heart rate variability, which most smartwatches will measure.
Now, to some degree, there’s a confounding thing where people like you (very interested in exercise and psychology) might feel better than someone who doesn’t have that focus, and all things equal maybe people who focus on x-risk are less likely to focus on exercise/meditation/circling/etc.
So the other thing is, if people who believe in x-risk take up a mental/physical health practice, you can ask if it makes them feel better, and also if it makes them more likely to stop believing in x-risk.
I love this direction of inquiry. It’s tricky to get right because of lots of confounds. But I think something like it should be doable.
I love that this is a space where people care about this stuff.
I’m poorly suited to actually carry out these examinations on my own. But if someone does, or if someone wants to lead the charge and would like me to help design the study, I’d love to hear about it!
The worlds you want to distinguish are those with short term doom and those without. Subconscious terror is the pesky confounder, not the thing you (in the end) want to distinguish. And it can be present or absent in either of the worlds you really want to distinguish, you can have doom+terror, or nodoom+terror, or doom+noterror. In the doom+terror world, distinguishing doom from terror can become a false dichotomy, if the issue with the arguments is not formulated more carefully than just saying it like that.
So the goal should be to figure out how to control for terror (if it’s indeed an important factor), and an argument shouldn’t be “distinguishing between doom and terror”, it should still be distinguishing between doom and no (short term) doom, possibly accounting for the terror factor in the process.
Analyzing from the outside, I agree.
The pesky thing is that we can’t fully analyze it from the outside. The analysis itself can be colored by a terror generator that has nothing to do with the objective situation.
So if there’s reason to think there’s a subconscious distortion happening to our collective reasoning, distinguishing between doom and nodoom might functionally have sorting out terror as a prerequisite. Which sucks in terms of “If there’s doom then we don’t want to waste time working on things that aren’t related.” But if we literally cannot tell what’s real due to distortions in perception, then sorting out those perception errors becomes the top priority.
(I’m describing it in black-and-white framing to make the logic clear, not to assert that we literally cannot tell at all what’s going on.)
When you can’t figure something out, you need to act under uncertainty. The question is still doom vs. no short term doom. Even if you conclude “terror”, that is only an argument for the uncertainty being unresolvable (with some class of arguments that would otherwise help), not an argument that doom has been ruled out (there still needs to be some prior). The “doom vs. terror” framing doesn’t adequately capture this.
Since 5-20% doom within 10 years is a relatively popular position, mixing in more nodoom because terror made certain doom vs. nodoom arguments useless doesn’t change this state of uncertainty too much, the decision relevant implications remain about the same. That is, it’s still worth working on a mixture of short term and longer term projects, possibly even very long term ones that almost inevitably won’t be fruitful before takeoff-capable AGI (because we might use the head start to prompt the AGIs into completing such projects before takeoff).
Feels like that’s a Motte and Bailey?
Sure, if you believe that the claimed psychological effects are very, very strong, then that’s the case.
However, let’s suppose you don’t believe they’re quite that strong. Even if you believe that these effects are pretty strong, then sufficiently strong arguments may still provide decent evidence.
In terms of why the evidence I provided is strong:
There’s a well known phenomenon of “cranks” where non-experts end up believing in arguments that sound super persuasive to them, but which are obviously false to experts. The CAIS letter ruled that out.
It’s well-known that it’s easy to construct theoretical arguments that sound extremely persuasive, but bear no relation to how things work in practise. Having a degree of empirical evidence of some of the core claims greatly weakens these arguments.
So it’s not just that these are strong arguments. It’s that these are arguments that you might expect to provide some signal even if you thought the claimed effect was strong, but not overwhelmingly so.
I really don’t think so, but I’m not sure why you’re saying so, so maybe I’m missing something. If I keep doing something that looks to you like a motte-and-bailey, could you point out specifically the structure? Like, what looks my motte and what looks like my bailey?
Sure, but arguing that doom is real does nothing to say what proportion of the doom models’ spread is due to something akin to the trauma model. And that matters because… well, there’s a different motte-and-bailey structure that can go “OMG it’s SO BAD… but look, here’s the very reasonable argument for how it’s going to be this particular shape of challenging for us… SO IT’S SUPER BAD!!!” It’ll tend to warp how we see those otherwise reasonable arguments, and which ones get emphasized, and how.
Listing more such arguments, or illustrating how compelling they are, doesn’t do anything to suggest how much this warping phenomenon is or isn’t happening.
I mean, by analogy: someone can have a panic attack due to being trapped in a burning building. They’re going to have a hard time doing anything about their situation while having a panic attack. Arguments that they really truly are in a burning building don’t affect the truth of that point, regardless of how compelling those arguments that the building really is on fire are.
Between an overwhelmingly strong effect and a pretty strong effect.