Nice. I agree with the core point: predictors are vastly safer than goal-directed systems. Unfortunately, I think that tool AIs want to become agents, for pragmatic reasons: agents can get things done without human intervention at every step, and agents can actively break problems into subproblems, making them vastly easier to solve. That exerts pressure to develop language model predictors into language model agent systems. But I think having predictors at the core removes a huge obstacle to alignment, so that aligning language model agents is easier than any other realistic type of first AGI.
LLMs functionally sometimes have goal preferences, since they’re simulating a character when they predict the next word. But those simulated character preferences should be small relative to the overall predictions, as long as you don’t prompt in a way that keeps it simulating one character for its whole task.
Unfortunately, I think that tool AIs want to become agents
Tool AIs are probably key components of aligned or at least debuggable and instruction following AGIs. If you have aligned AGIs, it’s probably trivial to build misaligned agents using the same methods, whether tool AIs were their components or not. Perhaps even blind-to-the-world pivotal AIs could be trained on real world datasets instead to become general agents. So this is hardly an argument against a line of alignment investigation, as this danger seems omnipresent.
Unfortunately, it tends to come up in that context. For example, Drexler felt compelled to disclaim in a recent post:
My intention is not to disregard agent-focused concerns — their importance is assumed, not debated. Indeed, the AI services model anticipates a world in which dangerous superintelligent agents could emerge with relative ease, and perhaps unavoidably. My aim is to broaden the working ontology of the community to include systems in which superintelligent-level capabilities can take a more accessible, transparent, and manageable form, open agencies rather than unitary agents.
I agree. I didn’t intend it as an argument against that line of research, because I think adapting oracles into agents is inevitable. Interesting that Drexler says the same thing, and his idea of having controlled strong AI systems as a counterbalance is interesting.
Nice. I agree with the core point: predictors are vastly safer than goal-directed systems. Unfortunately, I think that tool AIs want to become agents, for pragmatic reasons: agents can get things done without human intervention at every step, and agents can actively break problems into subproblems, making them vastly easier to solve. That exerts pressure to develop language model predictors into language model agent systems. But I think having predictors at the core removes a huge obstacle to alignment, so that aligning language model agents is easier than any other realistic type of first AGI.
LLMs functionally sometimes have goal preferences, since they’re simulating a character when they predict the next word. But those simulated character preferences should be small relative to the overall predictions, as long as you don’t prompt in a way that keeps it simulating one character for its whole task.
Tool AIs are probably key components of aligned or at least debuggable and instruction following AGIs. If you have aligned AGIs, it’s probably trivial to build misaligned agents using the same methods, whether tool AIs were their components or not. Perhaps even blind-to-the-world pivotal AIs could be trained on real world datasets instead to become general agents. So this is hardly an argument against a line of alignment investigation, as this danger seems omnipresent.
Unfortunately, it tends to come up in that context. For example, Drexler felt compelled to disclaim in a recent post:
I agree. I didn’t intend it as an argument against that line of research, because I think adapting oracles into agents is inevitable. Interesting that Drexler says the same thing, and his idea of having controlled strong AI systems as a counterbalance is interesting.