The problem with the maths is that it does not correlate ‘values’ with any real world observable. You give all objects a property, you say that that property is distributed by simplicity priors. You have not yet specified how these ‘values’ things relate to any real world phenomenon in any way. Under this model, you could never see any evidence that humans don’t ‘value’ maximizing paperclips.
To solve this, we need to understand what values are. The values of a human are much like the filenames on a hard disk. If you run a quantum field theory simulation, you don’t have to think about either, you can make your predictions directly. If you want to make approximate predictions about how a human will behave, you can think in terms of values and get somewhat useful predictions. If you want to predict approximately how a computer system will behave, instead of simulating every transistor, you can think in terms folders and files.
I can substitute words in the ‘proof’ that humans don’t have values, and get a proof that computers don’t have files. It works the same way, you turn your uncertainty in the relation between the exact and the approximate into a confidence that the two are uncorrelated.
Making a somewhat naive and not formally specified assumption along the lines of, “the real action taken optimizes human values better than most possible actions” will get you a meaningful but not perfect definition of ‘values’. You still need to say exactly what a “possible action” is.
Making a somewhat naive and not formally specified assumption along the lines of, “the files are what you see when you click on the file viewer” will get you a meaningful but not perfect definition of ‘files’. You still need to say exactly what a “click” is. And how you translate a pattern of photons into a ‘file’.
We see that if you were running a quantum simulation of the universe, then getting values out of a virtual human is the same type of problem as getting files off a virtual computer.
The problem with the maths is that it does not correlate ‘values’ with any real world observable. You give all objects a property, you say that that property is distributed by simplicity priors. You have not yet specified how these ‘values’ things relate to any real world phenomenon in any way. Under this model, you could never see any evidence that humans don’t ‘value’ maximizing paperclips.
To solve this, we need to understand what values are. The values of a human are much like the filenames on a hard disk. If you run a quantum field theory simulation, you don’t have to think about either, you can make your predictions directly. If you want to make approximate predictions about how a human will behave, you can think in terms of values and get somewhat useful predictions. If you want to predict approximately how a computer system will behave, instead of simulating every transistor, you can think in terms folders and files.
I can substitute words in the ‘proof’ that humans don’t have values, and get a proof that computers don’t have files. It works the same way, you turn your uncertainty in the relation between the exact and the approximate into a confidence that the two are uncorrelated. Making a somewhat naive and not formally specified assumption along the lines of, “the real action taken optimizes human values better than most possible actions” will get you a meaningful but not perfect definition of ‘values’. You still need to say exactly what a “possible action” is.
Making a somewhat naive and not formally specified assumption along the lines of, “the files are what you see when you click on the file viewer” will get you a meaningful but not perfect definition of ‘files’. You still need to say exactly what a “click” is. And how you translate a pattern of photons into a ‘file’.
We see that if you were running a quantum simulation of the universe, then getting values out of a virtual human is the same type of problem as getting files off a virtual computer.
I like this analogy. Probably not best to put too much weight on it, but it has some insights.