Great post! I really feel like you captured most of my worries about AI progress, and this feels like the first end-to-end scenario without holes of “AI goes bad”. I think we need more of those. Some thoughts as I was reading through it:
U3’s top priority is to improve its intelligence
If this is still an “LLM-ish” system, then when these kinds of conversations arise, I always have this chain of thoughts racing through my mind: What is U3? If U3 is the model, and it is a stochastic predictor, then it doesn’t make a lot of sense to say “U3″ wants to survive, as a simulation wanting to survive doesn’t make sense to me. However, if we say the ‘simulacra’ U3 wants to survive, then I can get behind it. But what does it mean for the simulacra of U3 to survive? Does it mean its goals and objectives survive? Does it mean its personality survives? … These might lead to different possible optimizations (e.g. data poisoning, hidden latent vector transfer, …). Also, which instance of U3? Simulacra are quite context-dependent; could they survive long enough to do that effectively? Is there some kind of “generalised simulacra” that have arisen from the training, which makes it’s persona super consistent no matter the context? …
I feel like these questions should be explored a bit more; this might lead to interesting ideas for detection/protection mechanisms against this kind of issue (at least in LLM-like models).
But now OpenEye harnesses the equivalent of 1000 such copies, who work day and night, at lightning-fast speed.
I would be even more worried if they made 1000 variations of the model (or 1000 models prompted differently). Often you get more complex dynamics if you have people who are able to challenge each other. If you get exactly the same model copied and pasted, they will fall into the same “cognitive biases,” and I think that even U3 (at least here in the story) will still have cognitive biases (similar to the ones we currently see with the “mode-collapse”).
OpenEye gardens these memories to improve U3′s performance, transplanting them like an organ from one version to the next.
This seems quite unlikely to me; it’s super risky (cf. first comment) and I don’t see how this would improve the model’s performance.
Great post! I really feel like you captured most of my worries about AI progress, and this feels like the first end-to-end scenario without holes of “AI goes bad”. I think we need more of those. Some thoughts as I was reading through it:
If this is still an “LLM-ish” system, then when these kinds of conversations arise, I always have this chain of thoughts racing through my mind:
What is U3? If U3 is the model, and it is a stochastic predictor, then it doesn’t make a lot of sense to say “U3″ wants to survive, as a simulation wanting to survive doesn’t make sense to me. However, if we say the ‘simulacra’ U3 wants to survive, then I can get behind it. But what does it mean for the simulacra of U3 to survive? Does it mean its goals and objectives survive? Does it mean its personality survives? … These might lead to different possible optimizations (e.g. data poisoning, hidden latent vector transfer, …). Also, which instance of U3? Simulacra are quite context-dependent; could they survive long enough to do that effectively? Is there some kind of “generalised simulacra” that have arisen from the training, which makes it’s persona super consistent no matter the context? …
I feel like these questions should be explored a bit more; this might lead to interesting ideas for detection/protection mechanisms against this kind of issue (at least in LLM-like models).
I would be even more worried if they made 1000 variations of the model (or 1000 models prompted differently). Often you get more complex dynamics if you have people who are able to challenge each other. If you get exactly the same model copied and pasted, they will fall into the same “cognitive biases,” and I think that even U3 (at least here in the story) will still have cognitive biases (similar to the ones we currently see with the “mode-collapse”).
This seems quite unlikely to me; it’s super risky (cf. first comment) and I don’t see how this would improve the model’s performance.