I wouldn’t describe my research as super theoretical, but it does involve me making arguments (although not formal proofs) about why I expect my plans to work even as training continues.
For example, it seems like in some non-formalized sense, a human is trivially aligned to themselves. The more similar your AI’s behavior is to a particular human’s behavior, the more aligned it is to that human. If you want to align the AI to a group of humans (e.g. all of humanity), you might want to start by emulating a good approximation of all humans and bootstrapping from there. I’m not working on this aspect of the problem directly—I’m just assuming that the LLM is pretty humanlike to begin with—but I wrote a post talking about similar ideas.
I wouldn’t describe my research as super theoretical, but it does involve me making arguments (although not formal proofs) about why I expect my plans to work even as training continues.
For example, it seems like in some non-formalized sense, a human is trivially aligned to themselves. The more similar your AI’s behavior is to a particular human’s behavior, the more aligned it is to that human. If you want to align the AI to a group of humans (e.g. all of humanity), you might want to start by emulating a good approximation of all humans and bootstrapping from there. I’m not working on this aspect of the problem directly—I’m just assuming that the LLM is pretty humanlike to begin with—but I wrote a post talking about similar ideas.