If we keep stumbling into LLM type things which are competent at a surprisingly wide range of tasks, do you expect that they’ll be worse at philosophy than at other tasks?
I’m not sure but I do think it’s very risky to depend on LLMs to be good at philosophy by default. Some of my thoughts on this:
Humans do a lot of bad philosophy and often can’t recognize good philosophy. (See popularity of two-boxing among professional philosophers.) Even if a LLM has learned how to do good philosophy, how will users or AI developers know how to prompt it to elicit that capability (e.g., which philosophers to emulate)? (It’s possible that even solving metaphilosophy doesn’t help enough with this, if many people can’t recognize the solution as correct, but there’s at least a chance that the solution does look obviously correct to many people, especially if there’s not already wrong solutions to compete with).
What if it learns how to do good philosophy during pre-training, but RLHF trains that away in favor of optimizing arguments to look good to the user.
What if philosophy is just intrinsically hard for ML in general (I gave an argument for why ML might have trouble learning philosophy from humans in the section Replicate the trajectory with ML? of Some Thoughts on Metaphilosophy, but I’m not sure how strong it is) or maybe it’s just some specific LLM architecture that has trouble with this, and we never figure this out because the AI is good at finding arguments that look good to humans?
Or maybe we do figure out that AI is worse at philosophy than other tasks, after it has been built, but it’s too late to do anything with that knowledge (because who is going to tell the investors that they’ve lost their money because we don’t want to differentially decelerate philosophical progress by deploying the AI).
It is generally understood now that ethics is subjective, in the following technical sense: ‘what final goals you have’ is a ~free parameter in powerful-mind-space, such that if you make a powerful mind without specifically having a mechanism for getting it to have only the goals you want, it’ll probably end up with goals you don’t want. What if ethics isn’t the only such free parameter? Indeed, philosophers tell us that in the bayesian framework your priors are subjective in this sense, and also that your decision theory is subjective in this sense maybe. Perhaps, therefore, what we consider “doing good/wise philosophy” is going to involve at least a few subjective elements, where what we want is for our AGIs to do philosophy (with respect to those elements) in the same way that we would want and not in various other ways, and that won’t happen by default, we need to have some mechanism to make it happen.
I don’t say it’s not risky. The question is more, what’s the difference between doing philosophy and other intellectual tasks.
Here’s one way to look at it that just occurred to me. In domains with feedback, like science or just doing real world stuff in general, we learn some heuristics. Then we try to apply these heuristics to the stuff of our mind, and sometimes it works but more often it fails. And then doing good philosophy means having a good set of heuristics from outside of philosophy, and good instincts when to apply them or not. And some luck, in that some heuristics will happen to generalize to the stuff of our mind, but others won’t.
If this is a true picture, then running far ahead with philosophy is just inherently risky. The further you step away from heuristics that have been tested in reality, and their area of applicability, the bigger your error will be.
Do you have any examples that could illustrate your theory?
It doesn’t seem to fit my own experience. I became interested in Bayesian probability, universal prior, Tegmark multiverse, and anthropic reasoning during college, and started thinking about decision theory and ideas that ultimately led to UDT, but what heuristics could I have been applying, learned from what “domains with feedback”?
Maybe I used a heuristic like “computer science is cool, lets try to apply it to philosophical problems” but if the heuristics are this coarse grained, it doesn’t seem like the idea can explain how detailed philosophical reasoning happens, or be used to ensure AI philosophical competence?
Maybe one example is the idea of Dutch book. It comes originally from real world situations (sport betting and so on) and then we apply it to rationality in the abstract.
Or another example, much older, is how Socrates used analogy. It was one of his favorite tools I think. When talking about some confusing thing, he’d draw an analogy with something closer to experience. For example, “Is the nature of virtue different for men and for women?”—“Well, the nature of strength isn’t that much different between men and women, likewise the nature of health, so maybe virtue works the same way.” Obviously this way of reasoning can easily go wrong, but I think it’s also pretty indicative of how people do philosophy.
Can’t all of these concerns be reduced to a subset of the intent-alignment problem? If I tell the AI to “maximize ethical goodness” and it instead decides to “implement plans that sound maximally good to the user” or
“maximize my current guess of what the user meant by ethical goodness according to my possibly-bad philosophy,” that is different from what I intended, and thus the AI is unaligned.
If the AI starts off with some bad philosophy ideas just because it’s relatively unskilled in philosophy vs science, we can expect that 1) it will try very hard to get better at philosophy so that it can understand “what did the user mean by ‘maximize ethical goodness,’” and 2) it will try to preserve option value in the meantime so not much will be lost if its first guess was wrong. This assumes some base level of competence on the AI’s part, but if it can do groundbreaking science research, surely it can think of those two things (or we just tell it).
I’m not sure but I do think it’s very risky to depend on LLMs to be good at philosophy by default. Some of my thoughts on this:
Humans do a lot of bad philosophy and often can’t recognize good philosophy. (See popularity of two-boxing among professional philosophers.) Even if a LLM has learned how to do good philosophy, how will users or AI developers know how to prompt it to elicit that capability (e.g., which philosophers to emulate)? (It’s possible that even solving metaphilosophy doesn’t help enough with this, if many people can’t recognize the solution as correct, but there’s at least a chance that the solution does look obviously correct to many people, especially if there’s not already wrong solutions to compete with).
What if it learns how to do good philosophy during pre-training, but RLHF trains that away in favor of optimizing arguments to look good to the user.
What if philosophy is just intrinsically hard for ML in general (I gave an argument for why ML might have trouble learning philosophy from humans in the section Replicate the trajectory with ML? of Some Thoughts on Metaphilosophy, but I’m not sure how strong it is) or maybe it’s just some specific LLM architecture that has trouble with this, and we never figure this out because the AI is good at finding arguments that look good to humans?
Or maybe we do figure out that AI is worse at philosophy than other tasks, after it has been built, but it’s too late to do anything with that knowledge (because who is going to tell the investors that they’ve lost their money because we don’t want to differentially decelerate philosophical progress by deploying the AI).
Here’s another bullet point to add to the list:
It is generally understood now that ethics is subjective, in the following technical sense: ‘what final goals you have’ is a ~free parameter in powerful-mind-space, such that if you make a powerful mind without specifically having a mechanism for getting it to have only the goals you want, it’ll probably end up with goals you don’t want. What if ethics isn’t the only such free parameter? Indeed, philosophers tell us that in the bayesian framework your priors are subjective in this sense, and also that your decision theory is subjective in this sense maybe. Perhaps, therefore, what we consider “doing good/wise philosophy” is going to involve at least a few subjective elements, where what we want is for our AGIs to do philosophy (with respect to those elements) in the same way that we would want and not in various other ways, and that won’t happen by default, we need to have some mechanism to make it happen.
I don’t say it’s not risky. The question is more, what’s the difference between doing philosophy and other intellectual tasks.
Here’s one way to look at it that just occurred to me. In domains with feedback, like science or just doing real world stuff in general, we learn some heuristics. Then we try to apply these heuristics to the stuff of our mind, and sometimes it works but more often it fails. And then doing good philosophy means having a good set of heuristics from outside of philosophy, and good instincts when to apply them or not. And some luck, in that some heuristics will happen to generalize to the stuff of our mind, but others won’t.
If this is a true picture, then running far ahead with philosophy is just inherently risky. The further you step away from heuristics that have been tested in reality, and their area of applicability, the bigger your error will be.
Does this make sense?
Do you have any examples that could illustrate your theory?
It doesn’t seem to fit my own experience. I became interested in Bayesian probability, universal prior, Tegmark multiverse, and anthropic reasoning during college, and started thinking about decision theory and ideas that ultimately led to UDT, but what heuristics could I have been applying, learned from what “domains with feedback”?
Maybe I used a heuristic like “computer science is cool, lets try to apply it to philosophical problems” but if the heuristics are this coarse grained, it doesn’t seem like the idea can explain how detailed philosophical reasoning happens, or be used to ensure AI philosophical competence?
Maybe one example is the idea of Dutch book. It comes originally from real world situations (sport betting and so on) and then we apply it to rationality in the abstract.
Or another example, much older, is how Socrates used analogy. It was one of his favorite tools I think. When talking about some confusing thing, he’d draw an analogy with something closer to experience. For example, “Is the nature of virtue different for men and for women?”—“Well, the nature of strength isn’t that much different between men and women, likewise the nature of health, so maybe virtue works the same way.” Obviously this way of reasoning can easily go wrong, but I think it’s also pretty indicative of how people do philosophy.
Can’t all of these concerns be reduced to a subset of the intent-alignment problem? If I tell the AI to “maximize ethical goodness” and it instead decides to “implement plans that sound maximally good to the user” or “maximize my current guess of what the user meant by ethical goodness according to my possibly-bad philosophy,” that is different from what I intended, and thus the AI is unaligned.
If the AI starts off with some bad philosophy ideas just because it’s relatively unskilled in philosophy vs science, we can expect that 1) it will try very hard to get better at philosophy so that it can understand “what did the user mean by ‘maximize ethical goodness,’” and 2) it will try to preserve option value in the meantime so not much will be lost if its first guess was wrong. This assumes some base level of competence on the AI’s part, but if it can do groundbreaking science research, surely it can think of those two things (or we just tell it).