I find that I’m not very uncertain about where values come from. Although the exact details of the mechanisms in complex systems like humans remain murky, to me it seems pretty clear that we already have the answer from cybernetics: they’re the result of how we’re “wired” up into feedback loops. There’s perhaps the physics question of why is the universe full of feedback loops—and the metaphysics question of why does our universe have the physics it has—but given that the universe is full of feedback loops, values seem a natural consequence of this fact.
Agree somewhat though I think lack of confusion (already knowing) can go too far as well. Wanted to add that per Quine and later analyzed by Nozick, we seem to be homeostatic envelope extenders. That is, we start with maintaining homeostasis (which gets complex due to cognitive arms race social species stuff) and then add on being able to reason about things far from their original context in time and space and try to extend our homeostatic abilities to new contexts and increase their robustness over arbitrary timelines, locations, conditions.
And it seems arthrodiatomic (cutting across joints, i.e. non-joint-carving) to describe the envelope-extension process itself as being an instance of homeostasis.
This is a non-answer, and I wish you’d notice on your own that it’s a non-answer. From the dialogue:
Really I want to know the shape of values as they sit in a mind. I want to know that because I want to make a mind that has weird-shaped values. Namely, Corrigibility.
So, given that you know where values come from, do you know what it looks like to have a deeply corrigible strong mind, clearly enough to make one? I don’t think so, but please correct me if you do. Assuming you don’t, I suggest that understanding what values are and where they come from in a more joint-carving way might help.
In other words, saying that, besides some details, values come as “the result of how we’re “wired” up into feedback loops” is true enough, but not an answer. It would be like saying “our plans are the result of how our neurons fire” or “the Linux operating system is the result of how electrons move through the wires in my computer”. It’s not false, it’s just not an answer to the question we were asking.
So, given that you know where values come from, do you know what it looks like to have a deeply corrigible strong mind, clearly enough to make one? I don’t think so, but please correct me if you do. Assuming you don’t, I suggest that understanding what values are and where they come from in a more joint-carving way might help.
Yes, understanding values better would be better. The case I’ve madeelsewhere is that we can use cybernetics as the basis for this understanding. Hence my comment is to suggest that if you don’t know where values come from, I can offer what I believe to be a model that answers where values ultimately come from and gives a good basis for building up a more detailed model of values. Others are doing the same with compatible models, e.g. predictive processing.
I’ve not thought deeply about corrigibility recently, but my thinking on outer alignment more generally has been that, because Goodhart is robust, we cannot hope to get fully aligned AI by any means that measures, which leaves us with building AI with goals that are already aligned with ours (it seems quite likely we’re going to bootstrap to AI that helps us build this, though, so work on imperfect systems seems worthwhile, but I’ll ignore it here). I expect a similar situation for building just corrigibility.
So to build a corrigible AI, my model says we need to find the configuration of negative feedback circuits that implement a corrigible process. That doesn’t constrain the space to look in a lot, but it does some, and it makes it clear that what we have is an engineering rather than a theory challenge. I see this as advancing the question from “where do values come from?” to “how do I build a thing out of feedback circuits that has the values I want it to have?”.
I find that I’m not very uncertain about where values come from. Although the exact details of the mechanisms in complex systems like humans remain murky, to me it seems pretty clear that we already have the answer from cybernetics: they’re the result of how we’re “wired” up into feedback loops. There’s perhaps the physics question of why is the universe full of feedback loops—and the metaphysics question of why does our universe have the physics it has—but given that the universe is full of feedback loops, values seem a natural consequence of this fact.
Agree somewhat though I think lack of confusion (already knowing) can go too far as well. Wanted to add that per Quine and later analyzed by Nozick, we seem to be homeostatic envelope extenders. That is, we start with maintaining homeostasis (which gets complex due to cognitive arms race social species stuff) and then add on being able to reason about things far from their original context in time and space and try to extend our homeostatic abilities to new contexts and increase their robustness over arbitrary timelines, locations, conditions.
And it seems arthrodiatomic (cutting across joints, i.e. non-joint-carving) to describe the envelope-extension process itself as being an instance of homeostasis.
Can you give pointers to where Quine and Nozick talk about this?
I mostly got this from Nozick’s final book Invariances
This is a non-answer, and I wish you’d notice on your own that it’s a non-answer. From the dialogue:
So, given that you know where values come from, do you know what it looks like to have a deeply corrigible strong mind, clearly enough to make one? I don’t think so, but please correct me if you do. Assuming you don’t, I suggest that understanding what values are and where they come from in a more joint-carving way might help.
In other words, saying that, besides some details, values come as “the result of how we’re “wired” up into feedback loops” is true enough, but not an answer. It would be like saying “our plans are the result of how our neurons fire” or “the Linux operating system is the result of how electrons move through the wires in my computer”. It’s not false, it’s just not an answer to the question we were asking.
Yes, understanding values better would be better. The case I’ve made elsewhere is that we can use cybernetics as the basis for this understanding. Hence my comment is to suggest that if you don’t know where values come from, I can offer what I believe to be a model that answers where values ultimately come from and gives a good basis for building up a more detailed model of values. Others are doing the same with compatible models, e.g. predictive processing.
I’ve not thought deeply about corrigibility recently, but my thinking on outer alignment more generally has been that, because Goodhart is robust, we cannot hope to get fully aligned AI by any means that measures, which leaves us with building AI with goals that are already aligned with ours (it seems quite likely we’re going to bootstrap to AI that helps us build this, though, so work on imperfect systems seems worthwhile, but I’ll ignore it here). I expect a similar situation for building just corrigibility.
So to build a corrigible AI, my model says we need to find the configuration of negative feedback circuits that implement a corrigible process. That doesn’t constrain the space to look in a lot, but it does some, and it makes it clear that what we have is an engineering rather than a theory challenge. I see this as advancing the question from “where do values come from?” to “how do I build a thing out of feedback circuits that has the values I want it to have?”.