It’s helpful to know that we were thinking about different questions, but, like
There is some fact-of-the-matter about what, in practice, Sable’s kludgey mishmash of pseudogoals will actually tend towards. There are multiple ways this could potentially resolve into coherence, path dependent, same as humans.
[...]
It may not have a strong belief that it has any specific goals it wants to pursue, but it’s got some sense that there are some things it wants that humanity wouldn’t give it.
these are claims, albeit soft ones, about what kinds of goals arise, no?
Your FAQ argues theoretically (correctly) that the training data and score function alone don’t determine what AI systems aim for. But this doesn’t tell us we can’t predict anything about End Goals full stop: it just says the answer doesn’t follow directly from the training data.
The FAQ also assumes that AIs actually have “deep drives” but doesn’t explain where they come from or what they’re likely to be. This post discusses how they might arise and I am telling you that you can think about the mechanism you propose here to understand properties of the goals are likely to arise as a result of it[1]. This addresses a question that the FAQ you link does not: what can we say about what goals are likely to arise?
Yeah, the first paragraph is meant to allude to “there is some kind of fact of the matter” but not argue it’d be any particular thing.
This post discusses how they might arise and I am telling you that you can think about the mechanism you propose here to understand properties of the goals are likely to arise as a result of it[1]. This addresses a question that the FAQ you link does not: what can we say about what goals are likely to arise?
Yeah, I agree there’s some obvious followup worth doing here.
I agree it’s possible to make informed guesses about what drives will evolve (apart from the convergent instrumental drives, which are more obvious), and that’s an important research question that should get tons of effort. (I think it’s not in the IABIED FAQ because IABIED is focused on the relatively “easy calls”, and this is just straight up a hard call that involves careful research with the epistemic-grounding to avoid falling into various Cope Traps)
But, one of the “easy calls” is that “it’ll probably be pretty surprising and weird.” Because, while maybe we could have a decently accurate science of sub-human and eventually slightly-superhuman AI, once the AI’s capabilities rise to Extremely Vastly Powerful, they will find ways of achieving their goals that aren’t remotely limited by any of the circumstances of their ‘ancestral environment.’”
I don’t have immediate followup thoughts on “but how would we do the predicting?” but if you give me a bit more prompting on what directions you think are interesting I could riff on that.
....no it doesn’t? Or, I’m not sure how liberal you’re being with the word “basically”, but, this just seems false to me.
Cope Traps
Come on, I’m not doing this to you
The substance of what I mean here is “there is failure mode, exemplified by, say, the scientists studying insects and reproduction who predicted the insects would evolve to have fewer children when there wasn’t enough resources, but what actually happened is they started eating the offspring of rival insects of their species.”
There will be a significant temptation to predict “what will the AI do?” kinda hoping/expecting particular kinds of outcomes*, instead of straightforward rolling the simulation forward.
I think it is totally possible to do a good job with this, but, it is a real job requirement to be able to think about it in a detached/unbiased way.
*which includes, if an AI pessimist were running the experiment, assuming the outcome is always bad, to be clear.
It’s helpful to know that we were thinking about different questions, but, like
[...]
these are claims, albeit soft ones, about what kinds of goals arise, no?
Your FAQ argues theoretically (correctly) that the training data and score function alone don’t determine what AI systems aim for. But this doesn’t tell us we can’t predict anything about End Goals full stop: it just says the answer doesn’t follow directly from the training data.
The FAQ also assumes that AIs actually have “deep drives” but doesn’t explain where they come from or what they’re likely to be. This post discusses how they might arise and I am telling you that you can think about the mechanism you propose here to understand properties of the goals are likely to arise as a result of it[1]. This addresses a question that the FAQ you link does not: what can we say about what goals are likely to arise?
Of course, if this mechanism ends up being not very important, we could get very different outcomes.
Yeah, the first paragraph is meant to allude to “there is some kind of fact of the matter” but not argue it’d be any particular thing.
Yeah, I agree there’s some obvious followup worth doing here.
I agree it’s possible to make informed guesses about what drives will evolve (apart from the convergent instrumental drives, which are more obvious), and that’s an important research question that should get tons of effort. (I think it’s not in the IABIED FAQ because IABIED is focused on the relatively “easy calls”, and this is just straight up a hard call that involves careful research with the epistemic-grounding to avoid falling into various Cope Traps)
But, one of the “easy calls” is that “it’ll probably be pretty surprising and weird.” Because, while maybe we could have a decently accurate science of sub-human and eventually slightly-superhuman AI, once the AI’s capabilities rise to Extremely Vastly Powerful, they will find ways of achieving their goals that aren’t remotely limited by any of the circumstances of their ‘ancestral environment.’”
I don’t have immediate followup thoughts on “but how would we do the predicting?” but if you give me a bit more prompting on what directions you think are interesting I could riff on that.
IABIED says alignment is basically impossible
Come on, I’m not doing this to you
....no it doesn’t? Or, I’m not sure how liberal you’re being with the word “basically”, but, this just seems false to me.
The substance of what I mean here is “there is failure mode, exemplified by, say, the scientists studying insects and reproduction who predicted the insects would evolve to have fewer children when there wasn’t enough resources, but what actually happened is they started eating the offspring of rival insects of their species.”
There will be a significant temptation to predict “what will the AI do?” kinda hoping/expecting particular kinds of outcomes*, instead of straightforward rolling the simulation forward.
I think it is totally possible to do a good job with this, but, it is a real job requirement to be able to think about it in a detached/unbiased way.
*which includes, if an AI pessimist were running the experiment, assuming the outcome is always bad, to be clear.