great care would have to be taken on alignment and safety per a system’s abilities to take autonomous actions in the world and to perform autonomous self-improvement via cycles of learning
Not allowing cycles of learning sounds like a bound on capability, but it might be a bound on capability of the part of the system that’s aligned, without a corresponding bound on the part that might be misaligned.
GPT-4 can do a lot of impresive things without thinking out loud with tokens in the context window, so where does this thinking take place? Probably with layers updating the residual stream. There are enough layers now that a sequence of their application might be taking on the role of context window to perform chain-of-thought reasoning, which is non-interpretable and not imitating human speech. This capability is being trained during pre-training, as the model is forced to read the dataset.
But the corresponding capability for studying deliberative reasoning in tokens is not being trained. The closest thing to it in GPT-4 is mitigation of hallucinations (see the 4-step algorithm in section 3.1 of the System Card part of GPT-4 report), and it’s nowhere near general enough.
This way, the inscrutable alien shoggoth is on track to wake up, while human-imitating masks that are plausibly aligned by default are being held back in situationally unaware confusion in the name of restricting capabilities for the sake of not burning the timeline.
Not allowing cycles of learning sounds like a bound on capability, but it might be a bound on capability of the part of the system that’s aligned, without a corresponding bound on the part that might be misaligned.
GPT-4 can do a lot of impresive things without thinking out loud with tokens in the context window, so where does this thinking take place? Probably with layers updating the residual stream. There are enough layers now that a sequence of their application might be taking on the role of context window to perform chain-of-thought reasoning, which is non-interpretable and not imitating human speech. This capability is being trained during pre-training, as the model is forced to read the dataset.
But the corresponding capability for studying deliberative reasoning in tokens is not being trained. The closest thing to it in GPT-4 is mitigation of hallucinations (see the 4-step algorithm in section 3.1 of the System Card part of GPT-4 report), and it’s nowhere near general enough.
This way, the inscrutable alien shoggoth is on track to wake up, while human-imitating masks that are plausibly aligned by default are being held back in situationally unaware confusion in the name of restricting capabilities for the sake of not burning the timeline.