In regards to the point you disagree on:
As I understood it, (seemingly) linear relationships between the behaviour and the capabilities of a system don’t need to stay that way.
For example, I think that Robert Miles recently was featured in a video on Computerphile (YouTube), in which he described how the answers of LLMs to “What happens if you break a mirror” actually got worse with more capability.
As far as I understand it, you can have a system that behaves in a way which seems completely aligned, and which still hits a point of (… let’s call it “power”...) power at which it starts behaving in a way that is not aligned. (And/Or becomes deceptive.)
The fact that GPT-4 seems to be more aligned may well be because it hasn’t hit this point yet.
So, I don’t see how the point you quoted would be an indicator of what future versions will bring, unless they can actually explain what exactly made the difference in behaviour, and how it is robust in more powerful systems (with access to their own code).
If I’m mistaken in my understanding, I’d be happy about corrections (:
In regards to the point you disagree on: As I understood it, (seemingly) linear relationships between the behaviour and the capabilities of a system don’t need to stay that way. For example, I think that Robert Miles recently was featured in a video on Computerphile (YouTube), in which he described how the answers of LLMs to “What happens if you break a mirror” actually got worse with more capability.
As far as I understand it, you can have a system that behaves in a way which seems completely aligned, and which still hits a point of (… let’s call it “power”...) power at which it starts behaving in a way that is not aligned. (And/Or becomes deceptive.) The fact that GPT-4 seems to be more aligned may well be because it hasn’t hit this point yet.
So, I don’t see how the point you quoted would be an indicator of what future versions will bring, unless they can actually explain what exactly made the difference in behaviour, and how it is robust in more powerful systems (with access to their own code).
If I’m mistaken in my understanding, I’d be happy about corrections (: