In the near future, AI models might become extremely capable in math, programming and other formal fields, but not as capable in messy real world tasks.
Believing this affects the priorities of today’s alignment research: our efforts shouldn’t be in proving hard mathematical results, but in formalizing the general / philosophical ideas into mathematical language.
The first step is to translate everything into precise language, while the second step is to put all results in a completely formal language, like Lean’s mathlib. For example, we might come up with a formal definition to what an “aligned agent” means, and then give it to an intelligent system that outputs a 50-page formal Lean construction for such an object. If we believe our math axioms are correct (and that the library is well-implemented), then we should be able to trust the result.
What we shouldn’t trust in handing off to the system is the task of formalizing the problem itself. If we only have a rough idea of what we’re looking for, an adversarial system would come up with its own idea, that sounds good to us but has hidden subtleties that ruin the theory.
(Also, this means that we should be very capable in eliminating inconsistencies in Lean’s proof system and in our own axioms, otherwise the system could prove anything it wants by first reaching a contradiction!)
we might come up with a formal definition to what an “aligned agent” means
I believe this is the difficult part. More precisely, to describe what is the agent aligned with. We can start with treating human CEV as a black box, and specify mathematically what does it mean to be aligned with some abstract f(x). But then the problem will be to specify f(x).
A great consequence of smart AIs: every textbook with exercises is now a textbook with solutions. If you get stuck at a hard exercise, you no longer need to consult with friends or ask math stack exchange, you just let the model answer it (and even ask for partial hints instead of a full solution).
Textbook writers should also find it easy now to generate a document with hints and solutions to all exercises in the textbook. Someone should probably automate this?
In some way this is a great loss for students, being a new source of temptation (like the internet & stackexchange) to just look at the answers instead of actually eating your vegetables. It seems like a harder environment these days to Build Character.
There’s the question in each education level of whether having a Magical Answer Tool makes your habits better or worse.
Clearly elementary schoolers need to learn how to solve problems on their own, and they probably can’t be trusted with the Magical Answer Tool for that purpose.
Professional researchers are skilled enough in solving problems, so they’ll find a good use of the Magical Answer Tool even when (especially when!) studying new fields they don’t have an intuition in.
So, there’s some point between elementary school and research work that makes this tool more helpful than harmful.
I tend to think it’s in the undergrad level; once you enter university level mathematics (/physics/philosophy/etc), you’re expected to understand things on your own. I’m sure that many will be tempted by the Magical Answer Tool to their own disadvantage, but honestly, this might be a Skill Issue.
Accesible/capable AI is also why teachers are going to have to stop grading on “getting the right answer” and start incorporating more “show your reasoning” questions in exams without access to AI. Education will have to adapt to this new technology like it’s had to adapt to all new technology.
To be honest, done correctly this may actually be a net positive if we stop optimizing learners only for correct answers and instead focus on the actual process of learning.
I could see a class where the students were encouraged to explore a topic with AI and have to submit their transcript as part of the assignment; their prompts could then be reviewed (along with the AI answers to verify that there weren’t any mistakes that snuck in). Could give a lot of insight into the way a student approached the topic and show where their gaps are. Not saying this is the ultimate solution, but it does seem better than throwing up one’s hands in resignation.
I’m not so sure I agree with that. LLMs are very useful for tasks that are much quicker for me to verify than to do myself (like writing boilerplate code with a library I’ll only use once, which I don’t want to look up the documentation for), but if I don’t know how to do something, I’d be very wary of using an LLM, on the basis that it might make mistakes that I don’t catch.
This is an especially strong concern when learning something. You might even get to a correct answer, but through a ‘rule’ that was hallucinated or may not always apply.
This is also a skill issue: learning to vet your sources properly and analyze what someone told you. If teachers are not already teaching students to do this with books and websites they consult, then yes an LLM can make the problem worse bit it’s already pretty bad.
I think it’s a bit more nuanced than that. Going through a list of human sources of various kinds, getting a sense of who wrote them and how reliable they might be on any issue, and assembling a complete picture of what you’re researching from that is one thing. Going through an LLM’s response to a question and determining which parts are true, which parts are hallucinations, and which parts are almost true but contain subtle mistakes that no human would make in quite the same way is a very different matter. The former is a biologically-ingrained skill that’s useful everywhere, and the latter much less so.
The best example I can give is the experience of debugging LLM-written code versus debugging the code of a human. The human code might have errors in it, but these errors are the result of a flawed set of assumptions that can generally be identified from context. You look for a mistake, and there’s a ‘well’ of incorrect reasoning surrounding it that you can follow to the crux of the problem. With LLM code, this isn’t the case. It’s very easy for the model to match the surface-level stylistic conventions of good code, and this often results in very unpredictable mistakes that you’d never see a human dev make. Calls to imaginary functions, dummied out functionality that’s not marked as such, and so on.
There’s a time and place for everything, but I think the set of scenarios in which I’m learning an entirely new process and not being extremely selective about the quality of my sources is quite small.
If students are seeing an LLM as a source at all then they’ve already made the critical mistake, in the context of e.g. writing a paper or solving a problem.
Can the LLM cite its own sources, such that you can re-generate the relevant findings from those sources without referencing anything else the LLM claimed? Great. If not, then from a research POV it’s claims are no more authoritative than writing “My dad said...”
I am also very, very aware that our teachers, our schools, and our students are not at all set up to actually get people to learn the skills/mindset needed to do better, or to use LLMs well more generally, for many reasons. The problem is somewhat harder and vastly more important and urgent than it used to be before LLMs. I am not sure most people can achieve the necessary mindset and habits to get past this without a lot of societal and technological scaffolding we don’t have today, and I can only hope we’ll build that adequately.
In the near future, AI models might become extremely capable in math, programming and other formal fields, but not as capable in messy real world tasks.
Believing this affects the priorities of today’s alignment research: our efforts shouldn’t be in proving hard mathematical results, but in formalizing the general / philosophical ideas into mathematical language.
The first step is to translate everything into precise language, while the second step is to put all results in a completely formal language, like Lean’s mathlib. For example, we might come up with a formal definition to what an “aligned agent” means, and then give it to an intelligent system that outputs a 50-page formal Lean construction for such an object. If we believe our math axioms are correct (and that the library is well-implemented), then we should be able to trust the result.
What we shouldn’t trust in handing off to the system is the task of formalizing the problem itself. If we only have a rough idea of what we’re looking for, an adversarial system would come up with its own idea, that sounds good to us but has hidden subtleties that ruin the theory.
(Also, this means that we should be very capable in eliminating inconsistencies in Lean’s proof system and in our own axioms, otherwise the system could prove anything it wants by first reaching a contradiction!)
I believe this is the difficult part. More precisely, to describe what is the agent aligned with. We can start with treating human CEV as a black box, and specify mathematically what does it mean to be aligned with some abstract f(x). But then the problem will be to specify f(x).
A great consequence of smart AIs: every textbook with exercises is now a textbook with solutions. If you get stuck at a hard exercise, you no longer need to consult with friends or ask math stack exchange, you just let the model answer it (and even ask for partial hints instead of a full solution).
Textbook writers should also find it easy now to generate a document with hints and solutions to all exercises in the textbook. Someone should probably automate this?
In some way this is a great loss for students, being a new source of temptation (like the internet & stackexchange) to just look at the answers instead of actually eating your vegetables. It seems like a harder environment these days to Build Character.
There’s the question in each education level of whether having a Magical Answer Tool makes your habits better or worse.
Clearly elementary schoolers need to learn how to solve problems on their own, and they probably can’t be trusted with the Magical Answer Tool for that purpose.
Professional researchers are skilled enough in solving problems, so they’ll find a good use of the Magical Answer Tool even when (especially when!) studying new fields they don’t have an intuition in.
So, there’s some point between elementary school and research work that makes this tool more helpful than harmful.
I tend to think it’s in the undergrad level; once you enter university level mathematics (/physics/philosophy/etc), you’re expected to understand things on your own. I’m sure that many will be tempted by the Magical Answer Tool to their own disadvantage, but honestly, this might be a Skill Issue.
Accesible/capable AI is also why teachers are going to have to stop grading on “getting the right answer” and start incorporating more “show your reasoning” questions in exams without access to AI. Education will have to adapt to this new technology like it’s had to adapt to all new technology.
To be honest, done correctly this may actually be a net positive if we stop optimizing learners only for correct answers and instead focus on the actual process of learning.
I could see a class where the students were encouraged to explore a topic with AI and have to submit their transcript as part of the assignment; their prompts could then be reviewed (along with the AI answers to verify that there weren’t any mistakes that snuck in). Could give a lot of insight into the way a student approached the topic and show where their gaps are. Not saying this is the ultimate solution, but it does seem better than throwing up one’s hands in resignation.
I think we can go one step further: (with sufficiently smart AIs) every topic explanation is now a textbook with exercises.
I’m not so sure I agree with that. LLMs are very useful for tasks that are much quicker for me to verify than to do myself (like writing boilerplate code with a library I’ll only use once, which I don’t want to look up the documentation for), but if I don’t know how to do something, I’d be very wary of using an LLM, on the basis that it might make mistakes that I don’t catch.
This is an especially strong concern when learning something. You might even get to a correct answer, but through a ‘rule’ that was hallucinated or may not always apply.
This is also a skill issue: learning to vet your sources properly and analyze what someone told you. If teachers are not already teaching students to do this with books and websites they consult, then yes an LLM can make the problem worse bit it’s already pretty bad.
I think it’s a bit more nuanced than that. Going through a list of human sources of various kinds, getting a sense of who wrote them and how reliable they might be on any issue, and assembling a complete picture of what you’re researching from that is one thing. Going through an LLM’s response to a question and determining which parts are true, which parts are hallucinations, and which parts are almost true but contain subtle mistakes that no human would make in quite the same way is a very different matter. The former is a biologically-ingrained skill that’s useful everywhere, and the latter much less so.
The best example I can give is the experience of debugging LLM-written code versus debugging the code of a human. The human code might have errors in it, but these errors are the result of a flawed set of assumptions that can generally be identified from context. You look for a mistake, and there’s a ‘well’ of incorrect reasoning surrounding it that you can follow to the crux of the problem. With LLM code, this isn’t the case. It’s very easy for the model to match the surface-level stylistic conventions of good code, and this often results in very unpredictable mistakes that you’d never see a human dev make. Calls to imaginary functions, dummied out functionality that’s not marked as such, and so on.
There’s a time and place for everything, but I think the set of scenarios in which I’m learning an entirely new process and not being extremely selective about the quality of my sources is quite small.
If students are seeing an LLM as a source at all then they’ve already made the critical mistake, in the context of e.g. writing a paper or solving a problem.
Can the LLM cite its own sources, such that you can re-generate the relevant findings from those sources without referencing anything else the LLM claimed? Great. If not, then from a research POV it’s claims are no more authoritative than writing “My dad said...”
I am also very, very aware that our teachers, our schools, and our students are not at all set up to actually get people to learn the skills/mindset needed to do better, or to use LLMs well more generally, for many reasons. The problem is somewhat harder and vastly more important and urgent than it used to be before LLMs. I am not sure most people can achieve the necessary mindset and habits to get past this without a lot of societal and technological scaffolding we don’t have today, and I can only hope we’ll build that adequately.