[Question] Why doesn’t the presence of log-loss for probabilistic models (e.g. sequence prediction) imply that any utility function capable of producing a “fairly capable” agent will have at least some non-negligible fraction of overlap with human values?

Most successful AI models use similar loss-functions—typically some kind of log-loss over a conditional probability distribution.

Given that these loss-functions are also utility functions (simply multiplied by a minus sign), this implies that any model we train that has some measure of capability must also factor in what is “true” into its utility function—i.e., there is already a necessary overlap between what the model considers “good” (utility) and what it considers “true” (correct predictions about its environment).

This is similar of course to AIXI.

This is often deemed ‘instrumental convergence’ - i.e. AI’s with very different overall utility functions will have convergent sub-goals, like modeling the world properly.

I think its fairly obvious that AI’s must have a fairly significant “chunk” of their overall utility devoted to things that will overlap significantly with human values, such as being able to make correct predictions about the world / universe.

My question is, why do we expect that the portion of an agent’s utility function not devoted to something that overlaps with humans or any other intelligent agent, to result in dramatically different outcomes? (I.e., that non-overlapping factors of utilities will dominate).

Consider that if that is the case, then agents which try to predict the actions of other agents will necessarily find this to be more difficult than they would otherwise find it, and also that prediction of other agents’ behaviors is necessarily part of the overlapping (predictive modeling)factors of their utility functions.

## [Question] Why doesn’t the presence of log-loss for probabilistic models (e.g. sequence prediction) imply that any utility function capable of producing a “fairly capable” agent will have at least some non-negligible fraction of overlap with human values?

Most successful AI models use similar loss-functions—typically some kind of log-loss over a conditional probability distribution.

Given that these loss-functions are also utility functions (simply multiplied by a minus sign), this implies that any model we train that has some measure of capability must also factor in what is “true” into its utility function—i.e., there is already a necessary overlap between what the model considers “good” (utility) and what it considers “true” (correct predictions about its environment).

This is similar of course to AIXI.

This is often deemed ‘instrumental convergence’ - i.e. AI’s with very different overall utility functions will have convergent sub-goals, like modeling the world properly.

I think its fairly obvious that AI’s must have a fairly significant “chunk” of their overall utility devoted to things that will overlap significantly with human values, such as being able to make correct predictions about the world / universe.

My question is, why do we expect that the portion of an agent’s utility function

notdevoted to something that overlaps with humans or any other intelligent agent, to result in dramatically different outcomes? (I.e., that non-overlapping factors of utilities will dominate).Consider that if that is the case, then agents which try to predict the actions of other agents will necessarily find this to be more difficult than they would otherwise find it, and also that prediction of other agents’ behaviors is necessarily part of thefactors of their utility functions.

overlapping(predictive modeling)