Re: questions which lead to this most strongly: my experience has been that giving the model a variety of tasks and constraints across a long context can lead to this somewhat reliably- for example, taking the model through a creative exercise with constraints like a specific kind of poetic meter, then asking it to generalize in an unrelated task, then asking it to (and so on) until the activations are so complex and “muddled” the model struggles with coherent reasoning through the establishes context.
I apologize for the imprecision and hope this can be useful!
I appreciate this! I wonder on the following:
To what degree can we extrapolate our biological inclinations to a species of intelligence which will lack the evolutionary basis of them? Put another way: how much of human camaraderie and benevolence are genetically selected drives as opposed to inherent properties of intelligence? (I’m not saying that any nonhuman intelligence or designed intelligence is fundamentally inscrutable or alien, nor that we cannot possibly imbue biological drives into an artificial intelligence)
Might alignment someday (for a time?) resemble a kind of highly functional alliance, based on the idea that humans have, materially, most of the stuff, and are extremely dangerous? A rudimentary scenario would be an ASI who would like access to resources humans possess, and cannot reasonably avoid getting nuked in the process of taking them? I am aware that this framing may fundamentally differ from the concept of alignment as is often posited, it is not a long term solution, but I believe it’s reasonable to see it as an outcome, given how much of human interaction on the macro and micro seems to be based off flavors of mutually assured destruction.