I’m approaching it from a “theoretical” perspective[1], so I want to know how “humanlike reasoning” could be defined (beyond “here’s some trusted model which somehow imitates human judgement”) or why human-approved capability gain preserves alignment (like, what’s the core reason, what makes human judgement good?). So my biggest worry is not that the assumption might be false, but that the assumption is barely understood on the theoretical level.
What are your research interests? Are you interested in defining what “explanation” means (or at least in defining some properties/principles of explanations)? Typical LLM stuff is highly empirical, but I’m kinda following the pipe dream of glass box AI.
I’m contrasting theoretical and empirical approaches.
Empirical—“this is likely to work, based on evidence”. Theoretical—“this should work, based on math / logic / philosophy”.
Empirical—“if we can operationalize X for making experiments, we don’t need to look for a deeper definition”. Theoretical—“we need to look for a deeper definition anyway”.
I wouldn’t describe my research as super theoretical, but it does involve me making arguments (although not formal proofs) about why I expect my plans to work even as training continues.
For example, it seems like in some non-formalized sense, a human is trivially aligned to themselves. The more similar your AI’s behavior is to a particular human’s behavior, the more aligned it is to that human. If you want to align the AI to a group of humans (e.g. all of humanity), you might want to start by emulating a good approximation of all humans and bootstrapping from there. I’m not working on this aspect of the problem directly—I’m just assuming that the LLM is pretty humanlike to begin with—but I wrote a post talking about similar ideas.
I’m approaching it from a “theoretical” perspective[1], so I want to know how “humanlike reasoning” could be defined (beyond “here’s some trusted model which somehow imitates human judgement”) or why human-approved capability gain preserves alignment (like, what’s the core reason, what makes human judgement good?). So my biggest worry is not that the assumption might be false, but that the assumption is barely understood on the theoretical level.
What are your research interests? Are you interested in defining what “explanation” means (or at least in defining some properties/principles of explanations)? Typical LLM stuff is highly empirical, but I’m kinda following the pipe dream of glass box AI.
I’m contrasting theoretical and empirical approaches.
Empirical—“this is likely to work, based on evidence”. Theoretical—“this should work, based on math / logic / philosophy”.
Empirical—“if we can operationalize X for making experiments, we don’t need to look for a deeper definition”. Theoretical—“we need to look for a deeper definition anyway”.
I wouldn’t describe my research as super theoretical, but it does involve me making arguments (although not formal proofs) about why I expect my plans to work even as training continues.
For example, it seems like in some non-formalized sense, a human is trivially aligned to themselves. The more similar your AI’s behavior is to a particular human’s behavior, the more aligned it is to that human. If you want to align the AI to a group of humans (e.g. all of humanity), you might want to start by emulating a good approximation of all humans and bootstrapping from there. I’m not working on this aspect of the problem directly—I’m just assuming that the LLM is pretty humanlike to begin with—but I wrote a post talking about similar ideas.