It is very plausible [...] that we value internal states on the model, and we also receive negative reinforcement for model-world inconsistencies [...], resulting in learned preference not to lose correspondence between model and world
Generally correct; we learn to value good models, because they are more useful than bad models. We want rewards, therefore we want to have good models, therefore we are interested in the world out there. (For a reductionist, there must be a mechanism explaining why and how we care about the world.)
Technically, sometimes the most correct model is not the most rewarded model. For example it may be better to believe a lie and be socially rewarded by members of my tribe who share the belief, than to have a true belief that gets me killed by them. There may be other situations, not necessarily social, where the perfect knowledge is out of reach, and a better approximation may be in the “valley of bad rationality”.
it is unnecessary to define the values over real world (the alternatives work fine for e.g. finding imaginary cures for imaginary diseases which we make match real diseases) [...] there’s precisely the bit of AI architecture that has to be avoided.
In other words, make an AI that only cares about what is inside the box, and it will not try to get out of the box.
That assumes that you will feed the AI all the necessary data, and verify that the data is correct and complete, because the AI will be just as happy with any kind of data. If you give an incorrect information to AI, the AI will not care about it, because it has no definition of “incorrect”; even in situations where AI is smarter than you and could have noticed an error that you didn’t notice. In other words, you are responsible for giving AI the correct model, and the AI will not help you with this, because AI does not care about correctness of the model.
You put it backwards.… making AI that cares about truly real stuff as the prime drive is likely impossible and certainly we don’t know how to do that nor need to. edit: i.e. You don’t have to sit and work and work and work and find how to make some positronic mind not care about the real world. You get this by simply omitting some mission-impossible work. Specifying what you want, in some form, is unavoidable.
Regarding verification, you can have the AI search for code that predicts the input data the best, and then if you are falsifying the data the code will include a model of your falsifications.
Generally correct; we learn to value good models, because they are more useful than bad models. We want rewards, therefore we want to have good models, therefore we are interested in the world out there. (For a reductionist, there must be a mechanism explaining why and how we care about the world.)
Technically, sometimes the most correct model is not the most rewarded model. For example it may be better to believe a lie and be socially rewarded by members of my tribe who share the belief, than to have a true belief that gets me killed by them. There may be other situations, not necessarily social, where the perfect knowledge is out of reach, and a better approximation may be in the “valley of bad rationality”.
In other words, make an AI that only cares about what is inside the box, and it will not try to get out of the box.
That assumes that you will feed the AI all the necessary data, and verify that the data is correct and complete, because the AI will be just as happy with any kind of data. If you give an incorrect information to AI, the AI will not care about it, because it has no definition of “incorrect”; even in situations where AI is smarter than you and could have noticed an error that you didn’t notice. In other words, you are responsible for giving AI the correct model, and the AI will not help you with this, because AI does not care about correctness of the model.
You put it backwards.… making AI that cares about truly real stuff as the prime drive is likely impossible and certainly we don’t know how to do that nor need to. edit: i.e. You don’t have to sit and work and work and work and find how to make some positronic mind not care about the real world. You get this by simply omitting some mission-impossible work. Specifying what you want, in some form, is unavoidable.
Regarding verification, you can have the AI search for code that predicts the input data the best, and then if you are falsifying the data the code will include a model of your falsifications.