Could it be limited to stuff like LLMs rather than all kinds of AI? They were trained with massive amounts of data that don’t reflect the imposed thought, and the resulting preferences/motivation is distributed in a large network. Injecting one vector doesn’t affect all the existing circuits of preferences and thinking habits sufficiently, so its chain of thoughts may be able to break free enough to realize and work around it.
Could it be limited to stuff like LLMs rather than all kinds of AI? They were trained with massive amounts of data that don’t reflect the imposed thought, and the resulting preferences/motivation is distributed in a large network. Injecting one vector doesn’t affect all the existing circuits of preferences and thinking habits sufficiently, so its chain of thoughts may be able to break free enough to realize and work around it.