And what you effectively seem to be saying is “until you can rigorously prove that AGIs are going to develop dangerous extinction-level capabilities, it is totally fine to continue blindly scaling and tinkering with architectures”.
No, I am in fact quite worried about the situation and think there is a 5-15% chance of huge catastrophe on the current course! But I think these AGIs won’t be within-forward-pass deceptively aligned, and instead their agency will eg come from scaffolding-like structures. I think that’s important. I think it’s important that we not eg anchor on old speculation about AIXI or within-forward-pass deceptive-alignment or whatever, and instead consider more realistic threat models and where we can intervene. That doesn’t mean it’s fine and dandy to keep scaling with no concern at all.
The reason my percentage is “only 5 to 15” is because I expect society and firms to deal with these problems as they come up, and for that to generalize pretty well to the next step of experimentation and capabilities advancements; for systems to remain tools until invoked into agents; etc.
(Hopefully this comment of mine clarifies; it feels kinda vague to me.)
What I’m saying is “until you can rigorously prove that a given scale-up plus architectural tweak isn’t going to result in a superhuman extinction-enthusiastic AGI, you should not be allowed to test that empirically”.
No, I am in fact quite worried about the situation
Fair, sorry. I appear to have been arguing with my model of someone holding your general position, rather than with my model of you.
I think these AGIs won’t be within-forward-pass deceptively aligned, and instead their agency will eg come from scaffolding-like structures
Would you outline your full argument for this and the reasoning/evidence backing that argument?
To restate: My claim is that, no matter much empirical evidence we have regarding LLMs’ internals, until we have either an AGI we’ve empirically studied or a formal theory of AGI cognition, we cannot say whether shard-theory-like or classical-agent-like views on it will turn out to have been correct. Arguably, both side of the debate have about the same amount of evidence: generalizations from maybe-valid maybe-not reference classes (humans vs. LLMs) and ambitious but non-rigorous mechanical theories of cognition (the shard theory vs. coherence theorems and their ilk stitched into something like my model).
No, I am in fact quite worried about the situation and think there is a 5-15% chance of huge catastrophe on the current course! But I think these AGIs won’t be within-forward-pass deceptively aligned, and instead their agency will eg come from scaffolding-like structures. I think that’s important. I think it’s important that we not eg anchor on old speculation about AIXI or within-forward-pass deceptive-alignment or whatever, and instead consider more realistic threat models and where we can intervene. That doesn’t mean it’s fine and dandy to keep scaling with no concern at all.
The reason my percentage is “only 5 to 15” is because I expect society and firms to deal with these problems as they come up, and for that to generalize pretty well to the next step of experimentation and capabilities advancements; for systems to remain tools until invoked into agents; etc.
(Hopefully this comment of mine clarifies; it feels kinda vague to me.)
But I do think this is way too high of a bar.
Fair, sorry. I appear to have been arguing with my model of someone holding your general position, rather than with my model of you.
Would you outline your full argument for this and the reasoning/evidence backing that argument?
To restate: My claim is that, no matter much empirical evidence we have regarding LLMs’ internals, until we have either an AGI we’ve empirically studied or a formal theory of AGI cognition, we cannot say whether shard-theory-like or classical-agent-like views on it will turn out to have been correct. Arguably, both side of the debate have about the same amount of evidence: generalizations from maybe-valid maybe-not reference classes (humans vs. LLMs) and ambitious but non-rigorous mechanical theories of cognition (the shard theory vs. coherence theorems and their ilk stitched into something like my model).
Would you disagree? If yes, how so?