Bayesian Decision Theory tag - LessWrong 2.0 viewer
https://www.greaterwrong.com/
Bayesian Decision Theory tag - LessWrong 2.0 viewerxml-emitteren-usBayesian Probability is for things that are Space-like Separated from You by Scott Garrabrant
https://www.greaterwrong.com/posts/FvcyMMaJKhYibtFDD/bayesian-probability-is-for-things-that-are-space-like
<p>First, I should explain what I mean by space-like separated from you. Imagine a world that looks like a <a href="https://en.wikipedia.org/wiki/Bayesian_network">Bayesian network</a>, and imagine that you are a node in that Bayesian network. If there is a path from you to another node following edges in the network, I will say that node is time-like separated from you, and in your future. If there is a path from another node to you, I will say that node is time-like separated from you, and in your past. Otherwise, I will say that the node is space-like separated from you. </p><p>Nodes in your past can be thought of as things that you observe. When you think about physics, it sure does seem like there are a lot of things in your past that you do not observe, but I am not thinking about physics-time, I am thinking about logical-time. If something is in your past, but has no effect on what algorithm you are running on what observations you get, then it might as well be considered as space-like separated from you. If you compute how everything in the universe evaluates, the space-like separated things are the things that can be evaluated either before or after you, since their output does not change yours or vice-versa. If you partially observe a fact, then I want to say you can decompose that fact into the part that you observed and the part that you didn’t, and say that the part you observed is in your past, while the part you didn’t observe is space-like separated from you. (Whether or not you actually can decompose things like this is complicated, and related to whether or not you can use the tickle defense is the smoking lesion problem.)</p><p>Nodes in your future can be thought of as things that you control. These are not always things that you want to control. For example, you control the output of “You assign probability less than <span class="frac"><sup>1</sup>⁄<sub>2</sub></span> to this sentence,” but perhaps you wish you didn’t. Again, if you partially control a fact, I want to say that (maybe) you can break that fact into multiple nodes, some of which you control, and some of which you don’t.</p><p>So, you know the things in your past, so there is no need for probability there. You don’t know the things in your future, or things that are space-like separated from you. (Maybe. I’m not sure that talking about knowing things you control is not just a type error.) You may have cached that you should use Bayesian probability to deal with things you are uncertain about. You may have this justified by the fact that if you don’t use Bayesian probability, there is a Pareto improvement that will cause you to predict better in all worlds. The problem is that the standard justifications of Bayesian probability are in a framework where the facts that you are uncertain about are not in any way affected by whether or not you believe them! Therefore, our reasons for liking Bayesian probability do not apply to our uncertainty about the things that are in our future! Note that many things in our future (like our future observations) are also in the future of things that are space-like separated from us, so we want to use Bayes to reason about those things in order to have better beliefs about our observations.</p><p>I claim that logical inductors do not feel entirely Bayesian, and this might be why. They can’t if they are able to think about sentences like “You assign probability less than <span class="frac"><sup>1</sup>⁄<sub>2</sub></span> to this sentence.”</p>Scott GarrabrantFvcyMMaJKhYibtFDDTue, 10 Jul 2018 23:47:49 +0000Generalizing Foundations of Decision Theory by abramdemski
https://www.greaterwrong.com/posts/5bd75cc58225bf0670375373/generalizing-foundations-of-decision-theory
<style>.mjx-chtml {display: inline-block; line-height: 0; text-indent: 0; text-align: left; text-transform: none; font-style: normal; font-weight: normal; font-size: 100%; font-size-adjust: none; letter-spacing: normal; word-wrap: normal; word-spacing: normal; white-space: nowrap; float: none; direction: ltr; max-width: none; max-height: none; min-width: 0; min-height: 0; border: 0; margin: 0; padding: 1px 0}
.MJXc-display {display: block; text-align: center; margin: 1em 0; padding: 0}
.mjx-chtml[tabindex]:focus, body :focus .mjx-chtml[tabindex] {display: inline-table}
.mjx-full-width {text-align: center; display: table-cell!important; width: 10000em}
.mjx-math {display: inline-block; border-collapse: separate; border-spacing: 0}
.mjx-math * {display: inline-block; -webkit-box-sizing: content-box!important; -moz-box-sizing: content-box!important; box-sizing: content-box!important; text-align: left}
.mjx-numerator {display: block; text-align: center}
.mjx-denominator {display: block; text-align: center}
.MJXc-stacked {height: 0; position: relative}
.MJXc-stacked > * {position: absolute}
.MJXc-bevelled > * {display: inline-block}
.mjx-stack {display: inline-block}
.mjx-op {display: block}
.mjx-under {display: table-cell}
.mjx-over {display: block}
.mjx-over > * {padding-left: 0px!important; padding-right: 0px!important}
.mjx-under > * {padding-left: 0px!important; padding-right: 0px!important}
.mjx-stack > .mjx-sup {display: block}
.mjx-stack > .mjx-sub {display: block}
.mjx-prestack > .mjx-presup {display: block}
.mjx-prestack > .mjx-presub {display: block}
.mjx-delim-h > .mjx-char {display: inline-block}
.mjx-surd {vertical-align: top}
.mjx-mphantom * {visibility: hidden}
.mjx-merror {background-color: #FFFF88; color: #CC0000; border: 1px solid #CC0000; padding: 2px 3px; font-style: normal; font-size: 90%}
.mjx-annotation-xml {line-height: normal}
.mjx-menclose > svg {fill: none; stroke: currentColor}
.mjx-mtr {display: table-row}
.mjx-mlabeledtr {display: table-row}
.mjx-mtd {display: table-cell; text-align: center}
.mjx-label {display: table-row}
.mjx-box {display: inline-block}
.mjx-block {display: block}
.mjx-span {display: inline}
.mjx-char {display: block; white-space: pre}
.mjx-itable {display: inline-table; width: auto}
.mjx-row {display: table-row}
.mjx-cell {display: table-cell}
.mjx-table {display: table; width: 100%}
.mjx-line {display: block; height: 0}
.mjx-strut {width: 0; padding-top: 1em}
.mjx-vsize {width: 0}
.MJXc-space1 {margin-left: .167em}
.MJXc-space2 {margin-left: .222em}
.MJXc-space3 {margin-left: .278em}
.mjx-test.mjx-test-display {display: table!important}
.mjx-test.mjx-test-inline {display: inline!important; margin-right: -1px}
.mjx-test.mjx-test-default {display: block!important; clear: both}
.mjx-ex-box {display: inline-block!important; position: absolute; overflow: hidden; min-height: 0; max-height: none; padding: 0; border: 0; margin: 0; width: 1px; height: 60ex}
.mjx-test-inline .mjx-left-box {display: inline-block; width: 0; float: left}
.mjx-test-inline .mjx-right-box {display: inline-block; width: 0; float: right}
.mjx-test-display .mjx-right-box {display: table-cell!important; width: 10000em!important; min-width: 0; max-width: none; padding: 0; border: 0; margin: 0}
.MJXc-TeX-unknown-R {font-family: monospace; font-style: normal; font-weight: normal}
.MJXc-TeX-unknown-I {font-family: monospace; font-style: italic; font-weight: normal}
.MJXc-TeX-unknown-B {font-family: monospace; font-style: normal; font-weight: bold}
.MJXc-TeX-unknown-BI {font-family: monospace; font-style: italic; font-weight: bold}
.MJXc-TeX-ams-R {font-family: MJXc-TeX-ams-R,MJXc-TeX-ams-Rw}
.MJXc-TeX-cal-B {font-family: MJXc-TeX-cal-B,MJXc-TeX-cal-Bx,MJXc-TeX-cal-Bw}
.MJXc-TeX-frak-R {font-family: MJXc-TeX-frak-R,MJXc-TeX-frak-Rw}
.MJXc-TeX-frak-B {font-family: MJXc-TeX-frak-B,MJXc-TeX-frak-Bx,MJXc-TeX-frak-Bw}
.MJXc-TeX-math-BI {font-family: MJXc-TeX-math-BI,MJXc-TeX-math-BIx,MJXc-TeX-math-BIw}
.MJXc-TeX-sans-R {font-family: MJXc-TeX-sans-R,MJXc-TeX-sans-Rw}
.MJXc-TeX-sans-B {font-family: MJXc-TeX-sans-B,MJXc-TeX-sans-Bx,MJXc-TeX-sans-Bw}
.MJXc-TeX-sans-I {font-family: MJXc-TeX-sans-I,MJXc-TeX-sans-Ix,MJXc-TeX-sans-Iw}
.MJXc-TeX-script-R {font-family: MJXc-TeX-script-R,MJXc-TeX-script-Rw}
.MJXc-TeX-type-R {font-family: MJXc-TeX-type-R,MJXc-TeX-type-Rw}
.MJXc-TeX-cal-R {font-family: MJXc-TeX-cal-R,MJXc-TeX-cal-Rw}
.MJXc-TeX-main-B {font-family: MJXc-TeX-main-B,MJXc-TeX-main-Bx,MJXc-TeX-main-Bw}
.MJXc-TeX-main-I {font-family: MJXc-TeX-main-I,MJXc-TeX-main-Ix,MJXc-TeX-main-Iw}
.MJXc-TeX-main-R {font-family: MJXc-TeX-main-R,MJXc-TeX-main-Rw}
.MJXc-TeX-math-I {font-family: MJXc-TeX-math-I,MJXc-TeX-math-Ix,MJXc-TeX-math-Iw}
.MJXc-TeX-size1-R {font-family: MJXc-TeX-size1-R,MJXc-TeX-size1-Rw}
.MJXc-TeX-size2-R {font-family: MJXc-TeX-size2-R,MJXc-TeX-size2-Rw}
.MJXc-TeX-size3-R {font-family: MJXc-TeX-size3-R,MJXc-TeX-size3-Rw}
.MJXc-TeX-size4-R {font-family: MJXc-TeX-size4-R,MJXc-TeX-size4-Rw}
.MJXc-TeX-vec-R {font-family: MJXc-TeX-vec-R,MJXc-TeX-vec-Rw}
.MJXc-TeX-vec-B {font-family: MJXc-TeX-vec-B,MJXc-TeX-vec-Bx,MJXc-TeX-vec-Bw}
@font-face {font-family: MJXc-TeX-ams-R; src: local('MathJax_AMS'), local('MathJax_AMS-Regular')}
@font-face {font-family: MJXc-TeX-ams-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_AMS-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_AMS-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_AMS-Regular.otf') format('opentype')}
@font-face {font-family: MJXc-TeX-cal-B; src: local('MathJax_Caligraphic Bold'), local('MathJax_Caligraphic-Bold')}
@font-face {font-family: MJXc-TeX-cal-Bx; src: local('MathJax_Caligraphic'); font-weight: bold}
@font-face {font-family: MJXc-TeX-cal-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Caligraphic-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Caligraphic-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Caligraphic-Bold.otf') format('opentype')}
@font-face {font-family: MJXc-TeX-frak-R; src: local('MathJax_Fraktur'), local('MathJax_Fraktur-Regular')}
@font-face {font-family: MJXc-TeX-frak-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Fraktur-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Fraktur-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Fraktur-Regular.otf') format('opentype')}
@font-face {font-family: MJXc-TeX-frak-B; src: local('MathJax_Fraktur Bold'), local('MathJax_Fraktur-Bold')}
@font-face {font-family: MJXc-TeX-frak-Bx; src: local('MathJax_Fraktur'); font-weight: bold}
@font-face {font-family: MJXc-TeX-frak-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Fraktur-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Fraktur-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Fraktur-Bold.otf') format('opentype')}
@font-face {font-family: MJXc-TeX-math-BI; src: local('MathJax_Math BoldItalic'), local('MathJax_Math-BoldItalic')}
@font-face {font-family: MJXc-TeX-math-BIx; src: local('MathJax_Math'); font-weight: bold; font-style: italic}
@font-face {font-family: MJXc-TeX-math-BIw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Math-BoldItalic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Math-BoldItalic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Math-BoldItalic.otf') format('opentype')}
@font-face {font-family: MJXc-TeX-sans-R; src: local('MathJax_SansSerif'), local('MathJax_SansSerif-Regular')}
@font-face {font-family: MJXc-TeX-sans-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Regular.otf') format('opentype')}
@font-face {font-family: MJXc-TeX-sans-B; src: local('MathJax_SansSerif Bold'), local('MathJax_SansSerif-Bold')}
@font-face {font-family: MJXc-TeX-sans-Bx; src: local('MathJax_SansSerif'); font-weight: bold}
@font-face {font-family: MJXc-TeX-sans-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Bold.otf') format('opentype')}
@font-face {font-family: MJXc-TeX-sans-I; src: local('MathJax_SansSerif Italic'), local('MathJax_SansSerif-Italic')}
@font-face {font-family: MJXc-TeX-sans-Ix; src: local('MathJax_SansSerif'); font-style: italic}
@font-face {font-family: MJXc-TeX-sans-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Italic.otf') format('opentype')}
@font-face {font-family: MJXc-TeX-script-R; src: local('MathJax_Script'), local('MathJax_Script-Regular')}
@font-face {font-family: MJXc-TeX-script-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Script-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Script-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Script-Regular.otf') format('opentype')}
@font-face {font-family: MJXc-TeX-type-R; src: local('MathJax_Typewriter'), local('MathJax_Typewriter-Regular')}
@font-face {font-family: MJXc-TeX-type-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Typewriter-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Typewriter-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Typewriter-Regular.otf') format('opentype')}
@font-face {font-family: MJXc-TeX-cal-R; src: local('MathJax_Caligraphic'), local('MathJax_Caligraphic-Regular')}
@font-face {font-family: MJXc-TeX-cal-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Caligraphic-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Caligraphic-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Caligraphic-Regular.otf') format('opentype')}
@font-face {font-family: MJXc-TeX-main-B; src: local('MathJax_Main Bold'), local('MathJax_Main-Bold')}
@font-face {font-family: MJXc-TeX-main-Bx; src: local('MathJax_Main'); font-weight: bold}
@font-face {font-family: MJXc-TeX-main-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Bold.otf') format('opentype')}
@font-face {font-family: MJXc-TeX-main-I; src: local('MathJax_Main Italic'), local('MathJax_Main-Italic')}
@font-face {font-family: MJXc-TeX-main-Ix; src: local('MathJax_Main'); font-style: italic}
@font-face {font-family: MJXc-TeX-main-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Italic.otf') format('opentype')}
@font-face {font-family: MJXc-TeX-main-R; src: local('MathJax_Main'), local('MathJax_Main-Regular')}
@font-face {font-family: MJXc-TeX-main-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Regular.otf') format('opentype')}
@font-face {font-family: MJXc-TeX-math-I; src: local('MathJax_Math Italic'), local('MathJax_Math-Italic')}
@font-face {font-family: MJXc-TeX-math-Ix; src: local('MathJax_Math'); font-style: italic}
@font-face {font-family: MJXc-TeX-math-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Math-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Math-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Math-Italic.otf') format('opentype')}
@font-face {font-family: MJXc-TeX-size1-R; src: local('MathJax_Size1'), local('MathJax_Size1-Regular')}
@font-face {font-family: MJXc-TeX-size1-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size1-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size1-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size1-Regular.otf') format('opentype')}
@font-face {font-family: MJXc-TeX-size2-R; src: local('MathJax_Size2'), local('MathJax_Size2-Regular')}
@font-face {font-family: MJXc-TeX-size2-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size2-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size2-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size2-Regular.otf') format('opentype')}
@font-face {font-family: MJXc-TeX-size3-R; src: local('MathJax_Size3'), local('MathJax_Size3-Regular')}
@font-face {font-family: MJXc-TeX-size3-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size3-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size3-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size3-Regular.otf') format('opentype')}
@font-face {font-family: MJXc-TeX-size4-R; src: local('MathJax_Size4'), local('MathJax_Size4-Regular')}
@font-face {font-family: MJXc-TeX-size4-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size4-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size4-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size4-Regular.otf') format('opentype')}
@font-face {font-family: MJXc-TeX-vec-R; src: local('MathJax_Vector'), local('MathJax_Vector-Regular')}
@font-face {font-family: MJXc-TeX-vec-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Vector-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Vector-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Vector-Regular.otf') format('opentype')}
@font-face {font-family: MJXc-TeX-vec-B; src: local('MathJax_Vector Bold'), local('MathJax_Vector-Bold')}
@font-face {font-family: MJXc-TeX-vec-Bx; src: local('MathJax_Vector'); font-weight: bold}
@font-face {font-family: MJXc-TeX-vec-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Vector-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Vector-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Vector-Bold.otf') format('opentype')}
</style><body><p>This post is more about articulating motivations than about presenting anything new, but I think readers may learn something about the foundations of classical (evidential) decision theory as they stand.</p>
<h2>The Project</h2>
<p>Most people interested in decision theory know about the VNM theorem and the Dutch Book argument, and not much more. The VNM theorem shows that <em>if</em> we have to make decisions over gambles which follow the laws of probability, <em>and</em> our preferences obey four plausible postulates of rationality (the VNM axioms), <em>then</em> our preferences over gambles can be represented as an expected utility function. On the other hand, the Dutch Book argument <em>assumes</em> that we make decisions by expected utility, but perhaps with a non-probabilistic belief function. It then proves that any violation of probability theory implies a willingness to take sure-loss gambles. (Reverse Dutch Book arguments show that indeed, following the laws of probability eliminates these sure-loss bets.)</p>
<p>So we can more or less argue for expected utility theory starting from probability theory, and argue for probability theory starting from expected utility theory; but clearly, this is not enough to provide good reason to endorse Bayesian decision theory overall. Subsequent investigations which I will summarize have attempted to address this gap.</p>
<p>But first, why care?</p>
<ul><li><p>Logical Induction can be seen as resulting from a small tweak to the Dutch Book setup, relaxing it enough that it could apply to mathematical uncertainty. Although we were initially optimistic that Logical Induction would allow significant progress in decision theory, it has proven difficult to get a satisfying logical-induction DT. Perhaps it would be useful to instead understand the argument for DT as a whole, and try to relax the foundations of DT in “the same way” we relaxed the foundations of probability theory.</p></li><li><p>It seems likely to me that such a re-examination of the foundations would <em>automatically</em> provide justification for reflectively consistent decision theories like UDT. Hopefully I can make my intuitions clear as I describe things.</p></li><li><p>Furthermore, the foundations of DT seem like they aren’t that solid. Perhaps we’ve put blinders on by not investigating these arguments for DT in full. Even without the kind of modification to the assumptions which I’m proposing, we may find significant generalizations of DT are given just by dropping unjustified axioms in the existing foundations. We can already see one such generalization, the use of infinitesimal probability, by studying the history; I’ll explain this more.</p></li></ul>
<h2>Longer History</h2>
<h1>Justifying Probability Theory</h1>
<p>Before going into the attempts to justify Bayesian decision theory <em>in its entirety</em>, it’s worth mentioning Cox’s theorem, which is another way of justifying probability alone. Unlike the Dutch Book argument, it doesn’t rely on a connection between beliefs and decisions; instead, Cox makes a series of plausible assumptions about the nature of subjective belief, and concludes that any approach must either violate those assumptions or be essentially equivalent to probability theory.</p>
<p>There has been some <a href="http://biasandbelief.pbworks.com/w/page/6537213/References%20on%20the%20Cox%20Proof">controversy about holes in Cox’s argument</a>. Like other holes in the foundations which I will discuss later, it seems one conclusion we can draw by dropping unjustified assumptions is that there is no good reason to rule out infinitesimal probabilities. I haven’t understood the issues with Cox’s theorem yet, though, so I won’t remark on this further.</p>
<p>This is an opinionated summary of the foundations of decision theory, so I’ll remark on the relative quality of the justifications provided by the Dutch Book vs Cox. The Dutch Book argument provides what could be called <em>consequentialist</em> constraints on rationality: if you don’t follow them, something bad happens. I’ll treat this as the “highest tier” of argument. Cox’s argument relies on more <em>deontological</em> constraints: if you don’t follow them, it seems intuitively as if you’ve done something wrong. I’ll take this to be the second tier of justification.</p>
<h1>Justifying Decision Theory</h1>
<p><em>VNM</em></p>
<p>Before we move on to attempts to justify decision theory in full, let’s look at the VNM axioms in a little detail.</p>
<p>The set-up is that we’ve got a set of outcomes <span class="mathjax-inline-container mjpage"><span class="mjx-chtml"><span class="mjx-math" aria-label="\mathcal{O}"><span class="mjx-mrow" aria-hidden="true"><span class="mjx-texatom"><span class="mjx-mrow"><span class="mjx-mi"><span class="mjx-char MJXc-TeX-cal-R" style="padding-top: 0.446em; padding-bottom: 0.372em;">O</span></span></span></span></span></span></span></span>, and we consider lotteries over outcomes which associate a probability <span class="mathjax-inline-container mjpage"><span class="mjx-chtml"><span class="mjx-math" aria-label="p_i"><span class="mjx-mrow" aria-hidden="true"><span class="mjx-msubsup"><span class="mjx-base"><span class="mjx-mi"><span class="mjx-char MJXc-TeX-math-I" style="padding-top: 0.225em; padding-bottom: 0.446em;">p</span></span></span><span class="mjx-sub" style="font-size: 70.7%; vertical-align: -0.212em; padding-right: 0.071em;"><span class="mjx-mi" style=""><span class="mjx-char MJXc-TeX-math-I" style="padding-top: 0.446em; padding-bottom: 0.298em;">i</span></span></span></span></span></span></span></span> with each outcome (such that <span class="mathjax-inline-container mjpage"><span class="mjx-chtml"><span class="mjx-math" aria-label="0 \leq p_i \leq 1"><span class="mjx-mrow" aria-hidden="true"><span class="mjx-mn"><span class="mjx-char MJXc-TeX-main-R" style="padding-top: 0.372em; padding-bottom: 0.372em;">0</span></span><span class="mjx-mo MJXc-space3"><span class="mjx-char MJXc-TeX-main-R" style="padding-top: 0.372em; padding-bottom: 0.446em;">≤</span></span><span class="mjx-msubsup MJXc-space3"><span class="mjx-base"><span class="mjx-mi"><span class="mjx-char MJXc-TeX-math-I" style="padding-top: 0.225em; padding-bottom: 0.446em;">p</span></span></span><span class="mjx-sub" style="font-size: 70.7%; vertical-align: -0.212em; padding-right: 0.071em;"><span class="mjx-mi" style=""><span class="mjx-char MJXc-TeX-math-I" style="padding-top: 0.446em; padding-bottom: 0.298em;">i</span></span></span></span><span class="mjx-mo MJXc-space3"><span class="mjx-char MJXc-TeX-main-R" style="padding-top: 0.372em; padding-bottom: 0.446em;">≤</span></span><span class="mjx-mn MJXc-space3"><span class="mjx-char MJXc-TeX-main-R" style="padding-top: 0.372em; padding-bottom: 0.372em;">1</span></span></span></span></span></span> and <span class="mathjax-inline-container mjpage"><span class="mjx-chtml"><span class="mjx-math" aria-label="\sum_i p_i = 1"><span class="mjx-mrow" aria-hidden="true"><span class="mjx-munderover"><span class="mjx-base"><span class="mjx-mo"><span class="mjx-char MJXc-TeX-size1-R" style="padding-top: 0.519em; padding-bottom: 0.519em;">∑</span></span></span><span class="mjx-sub" style="font-size: 70.7%; vertical-align: -0.439em; padding-right: 0.071em;"><span class="mjx-mi" style=""><span class="mjx-char MJXc-TeX-math-I" style="padding-top: 0.446em; padding-bottom: 0.298em;">i</span></span></span></span><span class="mjx-msubsup MJXc-space1"><span class="mjx-base"><span class="mjx-mi"><span class="mjx-char MJXc-TeX-math-I" style="padding-top: 0.225em; padding-bottom: 0.446em;">p</span></span></span><span class="mjx-sub" style="font-size: 70.7%; vertical-align: -0.212em; padding-right: 0.071em;"><span class="mjx-mi" style=""><span class="mjx-char MJXc-TeX-math-I" style="padding-top: 0.446em; padding-bottom: 0.298em;">i</span></span></span></span><span class="mjx-mo MJXc-space3"><span class="mjx-char MJXc-TeX-main-R" style="padding-top: 0.077em; padding-bottom: 0.298em;">=</span></span><span class="mjx-mn MJXc-space3"><span class="mjx-char MJXc-TeX-main-R" style="padding-top: 0.372em; padding-bottom: 0.372em;">1</span></span></span></span></span></span>). We have a preference relation over outcomes, <span class="mathjax-inline-container mjpage"><span class="mjx-chtml"><span class="mjx-math" aria-label="\preceq"><span class="mjx-mrow" aria-hidden="true"><span class="mjx-mo"><span class="mjx-char MJXc-TeX-main-R" style="padding-top: 0.372em; padding-bottom: 0.446em;">⪯</span></span></span></span></span></span>, which must obey the following properties:</p>
<ol><li><p>(Completeness.) For any two lotteries <span class="mathjax-inline-container mjpage"><span class="mjx-chtml"><span class="mjx-math" aria-label="A, B"><span class="mjx-mrow" aria-hidden="true"><span class="mjx-mi"><span class="mjx-char MJXc-TeX-math-I" style="padding-top: 0.519em; padding-bottom: 0.298em;">A</span></span><span class="mjx-mo"><span class="mjx-char MJXc-TeX-main-R" style="margin-top: -0.144em; padding-bottom: 0.519em;">,</span></span><span class="mjx-mi MJXc-space1"><span class="mjx-char MJXc-TeX-math-I" style="padding-top: 0.446em; padding-bottom: 0.298em;">B</span></span></span></span></span></span>, either <span class="mathjax-inline-container mjpage"><span class="mjx-chtml"><span class="mjx-math" aria-label="A \prec B"><span class="mjx-mrow" aria-hidden="true"><span class="mjx-mi"><span class="mjx-char MJXc-TeX-math-I" style="padding-top: 0.519em; padding-bottom: 0.298em;">A</span></span><span class="mjx-mo MJXc-space3"><span class="mjx-char MJXc-TeX-main-R" style="padding-top: 0.225em; padding-bottom: 0.372em;">≺</span></span><span class="mjx-mi MJXc-space3"><span class="mjx-char MJXc-TeX-math-I" style="padding-top: 0.446em; padding-bottom: 0.298em;">B</span></span></span></span></span></span>, or <span class="mathjax-inline-container mjpage"><span class="mjx-chtml"><span class="mjx-math" aria-label="B \prec A"><span class="mjx-mrow" aria-hidden="true"><span class="mjx-mi"><span class="mjx-char MJXc-TeX-math-I" style="padding-top: 0.446em; padding-bottom: 0.298em;">B</span></span><span class="mjx-mo MJXc-space3"><span class="mjx-char MJXc-TeX-main-R" style="padding-top: 0.225em; padding-bottom: 0.372em;">≺</span></span><span class="mjx-mi MJXc-space3"><span class="mjx-char MJXc-TeX-math-I" style="padding-top: 0.519em; padding-bottom: 0.298em;">A</span></span></span></span></span></span>, or neither, written <span class="mathjax-inline-container mjpage"><span class="mjx-chtml"><span class="mjx-math" aria-label="A \sim B"><span class="mjx-mrow" aria-hidden="true"><span class="mjx-mi"><span class="mjx-char MJXc-TeX-math-I" style="padding-top: 0.519em; padding-bottom: 0.298em;">A</span></span><span class="mjx-mo MJXc-space3"><span class="mjx-char MJXc-TeX-main-R" style="padding-top: 0.077em; padding-bottom: 0.298em;">∼</span></span><span class="mjx-mi MJXc-space3"><span class="mjx-char MJXc-TeX-math-I" style="padding-top: 0.446em; padding-bottom: 0.298em;">B</span></span></span></span></span></span>. (”<span class="mathjax-inline-container mjpage"><span class="mjx-chtml"><span class="mjx-math" aria-label="A \prec B"><span class="mjx-mrow" aria-hidden="true"><span class="mjx-mi"><span class="mjx-char MJXc-TeX-math-I" style="padding-top: 0.519em; padding-bottom: 0.298em;">A</span></span><span class="mjx-mo MJXc-space3"><span class="mjx-char MJXc-TeX-main-R" style="padding-top: 0.225em; padding-bottom: 0.372em;">≺</span></span><span class="mjx-mi MJXc-space3"><span class="mjx-char MJXc-TeX-math-I" style="padding-top: 0.446em; padding-bottom: 0.298em;">B</span></span></span></span></span></span> or <span class="mathjax-inline-container mjpage"><span class="mjx-chtml"><span class="mjx-math" aria-label="A \sim B"><span class="mjx-mrow" aria-hidden="true"><span class="mjx-mi"><span class="mjx-char MJXc-TeX-math-I" style="padding-top: 0.519em; padding-bottom: 0.298em;">A</span></span><span class="mjx-mo MJXc-space3"><span class="mjx-char MJXc-TeX-main-R" style="padding-top: 0.077em; padding-bottom: 0.298em;">∼</span></span><span class="mjx-mi MJXc-space3"><span class="mjx-char MJXc-TeX-math-I" style="padding-top: 0.446em; padding-bottom: 0.298em;">B</span></span></span></span></span></span>” will be abbreviated as “<span class="mathjax-inline-container mjpage"><span class="mjx-chtml"><span class="mjx-math" aria-label="A \preceq B"><span class="mjx-mrow" aria-hidden="true"><span class="mjx-mi"><span class="mjx-char MJXc-TeX-math-I" style="padding-top: 0.519em; padding-bottom: 0.298em;">A</span></span><span class="mjx-mo MJXc-space3"><span class="mjx-char MJXc-TeX-main-R" style="padding-top: 0.372em; padding-bottom: 0.446em;">⪯</span></span><span class="mjx-mi MJXc-space3"><span class="mjx-char MJXc-TeX-math-I" style="padding-top: 0.446em; padding-bottom: 0.298em;">B</span></span></span></span></span></span>” as usual.)</p></li><li><p>(Transitivity.) If <span class="mathjax-inline-container mjpage"><span class="mjx-chtml"><span class="mjx-math" aria-label="A \preceq B"><span class="mjx-mrow" aria-hidden="true"><span class="mjx-mi"><span class="mjx-char MJXc-TeX-math-I" style="padding-top: 0.519em; padding-bottom: 0.298em;">A</span></span><span class="mjx-mo MJXc-space3"><span class="mjx-char MJXc-TeX-main-R" style="padding-top: 0.372em; padding-bottom: 0.446em;">⪯</span></span><span class="mjx-mi MJXc-space3"><span class="mjx-char MJXc-TeX-math-I" style="padding-top: 0.446em; padding-bottom: 0.298em;">B</span></span></span></span></span></span> and <span class="mathjax-inline-container mjpage"><span class="mjx-chtml"><span class="mjx-math" aria-label="B \preceq C"><span class="mjx-mrow" aria-hidden="true"><span class="mjx-mi"><span class="mjx-char MJXc-TeX-math-I" style="padding-top: 0.446em; padding-bottom: 0.298em;">B</span></span><span class="mjx-mo MJXc-space3"><span class="mjx-char MJXc-TeX-main-R" style="padding-top: 0.372em; padding-bottom: 0.446em;">⪯</span></span><span class="mjx-mi MJXc-space3"><span class="mjx-char MJXc-TeX-math-I" style="padding-top: 0.519em; padding-bottom: 0.298em; padding-right: 0.045em;">C</span></span></span></span></span></span>, then <span class="mathjax-inline-container mjpage"><span class="mjx-chtml"><span class="mjx-math" aria-label="A \preceq C"><span class="mjx-mrow" aria-hidden="true"><span class="mjx-mi"><span class="mjx-char MJXc-TeX-math-I" style="padding-top: 0.519em; padding-bottom: 0.298em;">A</span></span><span class="mjx-mo MJXc-space3"><span class="mjx-char MJXc-TeX-main-R" style="padding-top: 0.372em; padding-bottom: 0.446em;">⪯</span></span><span class="mjx-mi MJXc-space3"><span class="mjx-char MJXc-TeX-math-I" style="padding-top: 0.519em; padding-bottom: 0.298em; padding-right: 0.045em;">C</span></span></span></span></span></span>.</p></li><li><p>(Continuity.) If <span class="mathjax-inline-container mjpage"><span class="mjx-chtml"><span class="mjx-math" aria-label="A \preceq B \preceq C"><span class="mjx-mrow" aria-hidden="true"><span class="mjx-mi"><span class="mjx-char MJXc-TeX-math-I" style="padding-top: 0.519em; padding-bottom: 0.298em;">A</span></span><span class="mjx-mo MJXc-space3"><span class="mjx-char MJXc-TeX-main-R" style="padding-top: 0.372em; padding-bottom: 0.446em;">⪯</span></span><span class="mjx-mi MJXc-space3"><span class="mjx-char MJXc-TeX-math-I" style="padding-top: 0.446em; padding-bottom: 0.298em;">B</span></span><span class="mjx-mo MJXc-space3"><span class="mjx-char MJXc-TeX-main-R" style="padding-top: 0.372em; padding-bottom: 0.446em;">⪯</span></span><span class="mjx-mi MJXc-space3"><span class="mjx-char MJXc-TeX-math-I" style="padding-top: 0.519em; padding-bottom: 0.298em; padding-right: 0.045em;">C</span></span></span></span></span></span>, then there exists <span class="mathjax-inline-container mjpage"><span class="mjx-chtml"><span class="mjx-math" aria-label="p \in [0,1]"><span class="mjx-mrow" aria-hidden="true"><span class="mjx-mi"><span class="mjx-char MJXc-TeX-math-I" style="padding-top: 0.225em; padding-bottom: 0.446em;">p</span></span><span class="mjx-mo MJXc-space3"><span class="mjx-char MJXc-TeX-main-R" style="padding-top: 0.225em; padding-bottom: 0.372em;">∈</span></span><span class="mjx-mo MJXc-space3"><span class="mjx-char MJXc-TeX-main-R" style="padding-top: 0.446em; padding-bottom: 0.593em;">[</span></span><span class="mjx-mn"><span class="mjx-char MJXc-TeX-main-R" style="padding-top: 0.372em; padding-bottom: 0.372em;">0</span></span><span class="mjx-mo"><span class="mjx-char MJXc-TeX-main-R" style="margin-top: -0.144em; padding-bottom: 0.519em;">,</span></span><span class="mjx-mn MJXc-space1"><span class="mjx-char MJXc-TeX-main-R" style="padding-top: 0.372em; padding-bottom: 0.372em;">1</span></span><span class="mjx-mo"><span class="mjx-char MJXc-TeX-main-R" style="padding-top: 0.446em; padding-bottom: 0.593em;">]</span></span></span></span></span></span> such that a gamble <span class="mathjax-inline-container mjpage"><span class="mjx-chtml"><span class="mjx-math" aria-label="D"><span class="mjx-mrow" aria-hidden="true"><span class="mjx-mi"><span class="mjx-char MJXc-TeX-math-I" style="padding-top: 0.446em; padding-bottom: 0.298em;">D</span></span></span></span></span></span> assigning probability <span class="mathjax-inline-container mjpage"><span class="mjx-chtml"><span class="mjx-math" aria-label="p"><span class="mjx-mrow" aria-hidden="true"><span class="mjx-mi"><span class="mjx-char MJXc-TeX-math-I" style="padding-top: 0.225em; padding-bottom: 0.446em;">p</span></span></span></span></span></span> to <span class="mathjax-inline-container mjpage"><span class="mjx-chtml"><span class="mjx-math" aria-label="A"><span class="mjx-mrow" aria-hidden="true"><span class="mjx-mi"><span class="mjx-char MJXc-TeX-math-I" style="padding-top: 0.519em; padding-bottom: 0.298em;">A</span></span></span></span></span></span> and <span class="mathjax-inline-container mjpage"><span class="mjx-chtml"><span class="mjx-math" aria-label="(1-p)"><span class="mjx-mrow" aria-hidden="true"><span class="mjx-mo"><span class="mjx-char MJXc-TeX-main-R" style="padding-top: 0.446em; padding-bottom: 0.593em;">(</span></span><span class="mjx-mn"><span class="mjx-char MJXc-TeX-main-R" style="padding-top: 0.372em; padding-bottom: 0.372em;">1</span></span><span class="mjx-mo MJXc-space2"><span class="mjx-char MJXc-TeX-main-R" style="padding-top: 0.298em; padding-bottom: 0.446em;">−</span></span><span class="mjx-mi MJXc-space2"><span class="mjx-char MJXc-TeX-math-I" style="padding-top: 0.225em; padding-bottom: 0.446em;">p</span></span><span class="mjx-mo"><span class="mjx-char MJXc-TeX-main-R" style="padding-top: 0.446em; padding-bottom: 0.593em;">)</span></span></span></span></span></span> to <span class="mathjax-inline-container mjpage"><span class="mjx-chtml"><span class="mjx-math" aria-label="B"><span class="mjx-mrow" aria-hidden="true"><span class="mjx-mi"><span class="mjx-char MJXc-TeX-math-I" style="padding-top: 0.446em; padding-bottom: 0.298em;">B</span></span></span></span></span></span> satisfies <span class="mathjax-inline-container mjpage"><span class="mjx-chtml"><span class="mjx-math" aria-label="B \sim D"><span class="mjx-mrow" aria-hidden="true"><span class="mjx-mi"><span class="mjx-char MJXc-TeX-math-I" style="padding-top: 0.446em; padding-bottom: 0.298em;">B</span></span><span class="mjx-mo MJXc-space3"><span class="mjx-char MJXc-TeX-main-R" style="padding-top: 0.077em; padding-bottom: 0.298em;">∼</span></span><span class="mjx-mi MJXc-space3"><span class="mjx-char MJXc-TeX-math-I" style="padding-top: 0.446em; padding-bottom: 0.298em;">D</span></span></span></span></span></span>.</p></li><li><p>(Independence.) If <span class="mathjax-inline-container mjpage"><span class="mjx-chtml"><span class="mjx-math" aria-label="A \prec B"><span class="mjx-mrow" aria-hidden="true"><span class="mjx-mi"><span class="mjx-char MJXc-TeX-math-I" style="padding-top: 0.519em; padding-bottom: 0.298em;">A</span></span><span class="mjx-mo MJXc-space3"><span class="mjx-char MJXc-TeX-main-R" style="padding-top: 0.225em; padding-bottom: 0.372em;">≺</span></span><span class="mjx-mi MJXc-space3"><span class="mjx-char MJXc-TeX-math-I" style="padding-top: 0.446em; padding-bottom: 0.298em;">B</span></span></span></span></span></span>, then for any <span class="mathjax-inline-container mjpage"><span class="mjx-chtml"><span class="mjx-math" aria-label="C"><span class="mjx-mrow" aria-hidden="true"><span class="mjx-mi"><span class="mjx-char MJXc-TeX-math-I" style="padding-top: 0.519em; padding-bottom: 0.298em; padding-right: 0.045em;">C</span></span></span></span></span></span> and <span class="mathjax-inline-container mjpage"><span class="mjx-chtml"><span class="mjx-math" aria-label="p \in (0,1]"><span class="mjx-mrow" aria-hidden="true"><span class="mjx-mi"><span class="mjx-char MJXc-TeX-math-I" style="padding-top: 0.225em; padding-bottom: 0.446em;">p</span></span><span class="mjx-mo MJXc-space3"><span class="mjx-char MJXc-TeX-main-R" style="padding-top: 0.225em; padding-bottom: 0.372em;">∈</span></span><span class="mjx-mo MJXc-space3"><span class="mjx-char MJXc-TeX-main-R" style="padding-top: 0.446em; padding-bottom: 0.593em;">(</span></span><span class="mjx-mn"><span class="mjx-char MJXc-TeX-main-R" style="padding-top: 0.372em; padding-bottom: 0.372em;">0</span></span><span class="mjx-mo"><span class="mjx-char MJXc-TeX-main-R" style="margin-top: -0.144em; padding-bottom: 0.519em;">,</span></span><span class="mjx-mn MJXc-space1"><span class="mjx-char MJXc-TeX-main-R" style="padding-top: 0.372em; padding-bottom: 0.372em;">1</span></span><span class="mjx-mo"><span class="mjx-char MJXc-TeX-main-R" style="padding-top: 0.446em; padding-bottom: 0.593em;">]</span></span></span></span></span></span>, we have <span class="mathjax-inline-container mjpage"><span class="mjx-chtml"><span class="mjx-math" aria-label="p A + (1-p) C \prec p B + (1-p) C"><span class="mjx-mrow" aria-hidden="true"><span class="mjx-mi"><span class="mjx-char MJXc-TeX-math-I" style="padding-top: 0.225em; padding-bottom: 0.446em;">p</span></span><span class="mjx-mi"><span class="mjx-char MJXc-TeX-math-I" style="padding-top: 0.519em; padding-bottom: 0.298em;">A</span></span><span class="mjx-mo MJXc-space2"><span class="mjx-char MJXc-TeX-main-R" style="padding-top: 0.298em; padding-bottom: 0.446em;">+</span></span><span class="mjx-mo MJXc-space2"><span class="mjx-char MJXc-TeX-main-R" style="padding-top: 0.446em; padding-bottom: 0.593em;">(</span></span><span class="mjx-mn"><span class="mjx-char MJXc-TeX-main-R" style="padding-top: 0.372em; padding-bottom: 0.372em;">1</span></span><span class="mjx-mo MJXc-space2"><span class="mjx-char MJXc-TeX-main-R" style="padding-top: 0.298em; padding-bottom: 0.446em;">−</span></span><span class="mjx-mi MJXc-space2"><span class="mjx-char MJXc-TeX-math-I" style="padding-top: 0.225em; padding-bottom: 0.446em;">p</span></span><span class="mjx-mo"><span class="mjx-char MJXc-TeX-main-R" style="padding-top: 0.446em; padding-bottom: 0.593em;">)</span></span><span class="mjx-mi"><span class="mjx-char MJXc-TeX-math-I" style="padding-top: 0.519em; padding-bottom: 0.298em; padding-right: 0.045em;">C</span></span><span class="mjx-mo MJXc-space3"><span class="mjx-char MJXc-TeX-main-R" style="padding-top: 0.225em; padding-bottom: 0.372em;">≺</span></span><span class="mjx-mi MJXc-space3"><span class="mjx-char MJXc-TeX-math-I" style="padding-top: 0.225em; padding-bottom: 0.446em;">p</span></span><span class="mjx-mi"><span class="mjx-char MJXc-TeX-math-I" style="padding-top: 0.446em; padding-bottom: 0.298em;">B</span></span><span class="mjx-mo MJXc-space2"><span class="mjx-char MJXc-TeX-main-R" style="padding-top: 0.298em; padding-bottom: 0.446em;">+</span></span><span class="mjx-mo MJXc-space2"><span class="mjx-char MJXc-TeX-main-R" style="padding-top: 0.446em; padding-bottom: 0.593em;">(</span></span><span class="mjx-mn"><span class="mjx-char MJXc-TeX-main-R" style="padding-top: 0.372em; padding-bottom: 0.372em;">1</span></span><span class="mjx-mo MJXc-space2"><span class="mjx-char MJXc-TeX-main-R" style="padding-top: 0.298em; padding-bottom: 0.446em;">−</span></span><span class="mjx-mi MJXc-space2"><span class="mjx-char MJXc-TeX-math-I" style="padding-top: 0.225em; padding-bottom: 0.446em;">p</span></span><span class="mjx-mo"><span class="mjx-char MJXc-TeX-main-R" style="padding-top: 0.446em; padding-bottom: 0.593em;">)</span></span><span class="mjx-mi"><span class="mjx-char MJXc-TeX-math-I" style="padding-top: 0.519em; padding-bottom: 0.298em; padding-right: 0.045em;">C</span></span></span></span></span></span>.</p></li></ol>
<p>Transitivity is often considered to be justified by the money-pump argument. Suppose that you violate transitivity for some <span class="mathjax-inline-container mjpage"><span class="mjx-chtml"><span class="mjx-math" aria-label="A, B, C"><span class="mjx-mrow" aria-hidden="true"><span class="mjx-mi"><span class="mjx-char MJXc-TeX-math-I" style="padding-top: 0.519em; padding-bottom: 0.298em;">A</span></span><span class="mjx-mo"><span class="mjx-char MJXc-TeX-main-R" style="margin-top: -0.144em; padding-bottom: 0.519em;">,</span></span><span class="mjx-mi MJXc-space1"><span class="mjx-char MJXc-TeX-math-I" style="padding-top: 0.446em; padding-bottom: 0.298em;">B</span></span><span class="mjx-mo"><span class="mjx-char MJXc-TeX-main-R" style="margin-top: -0.144em; padding-bottom: 0.519em;">,</span></span><span class="mjx-mi MJXc-space1"><span class="mjx-char MJXc-TeX-math-I" style="padding-top: 0.519em; padding-bottom: 0.298em; padding-right: 0.045em;">C</span></span></span></span></span></span>; that is, <span class="mathjax-inline-container mjpage"><span class="mjx-chtml"><span class="mjx-math" aria-label="A \preceq B"><span class="mjx-mrow" aria-hidden="true"><span class="mjx-mi"><span class="mjx-char MJXc-TeX-math-I" style="padding-top: 0.519em; padding-bottom: 0.298em;">A</span></span><span class="mjx-mo MJXc-space3"><span class="mjx-char MJXc-TeX-main-R" style="padding-top: 0.372em; padding-bottom: 0.446em;">⪯</span></span><span class="mjx-mi MJXc-space3"><span class="mjx-char MJXc-TeX-math-I" style="padding-top: 0.446em; padding-bottom: 0.298em;">B</span></span></span></span></span></span> and <span class="mathjax-inline-container mjpage"><span class="mjx-chtml"><span class="mjx-math" aria-label="B \preceq C"><span class="mjx-mrow" aria-hidden="true"><span class="mjx-mi"><span class="mjx-char MJXc-TeX-math-I" style="padding-top: 0.446em; padding-bottom: 0.298em;">B</span></span><span class="mjx-mo MJXc-space3"><span class="mjx-char MJXc-TeX-main-R" style="padding-top: 0.372em; padding-bottom: 0.446em;">⪯</span></span><span class="mjx-mi MJXc-space3"><span class="mjx-char MJXc-TeX-math-I" style="padding-top: 0.519em; padding-bottom: 0.298em; padding-right: 0.045em;">C</span></span></span></span></span></span>, but <span class="mathjax-inline-container mjpage"><span class="mjx-chtml"><span class="mjx-math" aria-label="C \prec A"><span class="mjx-mrow" aria-hidden="true"><span class="mjx-mi"><span class="mjx-char MJXc-TeX-math-I" style="padding-top: 0.519em; padding-bottom: 0.298em; padding-right: 0.045em;">C</span></span><span class="mjx-mo MJXc-space3"><span class="mjx-char MJXc-TeX-main-R" style="padding-top: 0.225em; padding-bottom: 0.372em;">≺</span></span><span class="mjx-mi MJXc-space3"><span class="mjx-char MJXc-TeX-math-I" style="padding-top: 0.519em; padding-bottom: 0.298em;">A</span></span></span></span></span></span>. Then you’ll be willing to trade away <span class="mathjax-inline-container mjpage"><span class="mjx-chtml"><span class="mjx-math" aria-label="A"><span class="mjx-mrow" aria-hidden="true"><span class="mjx-mi"><span class="mjx-char MJXc-TeX-math-I" style="padding-top: 0.519em; padding-bottom: 0.298em;">A</span></span></span></span></span></span> for <span class="mathjax-inline-container mjpage"><span class="mjx-chtml"><span class="mjx-math" aria-label="B"><span class="mjx-mrow" aria-hidden="true"><span class="mjx-mi"><span class="mjx-char MJXc-TeX-math-I" style="padding-top: 0.446em; padding-bottom: 0.298em;">B</span></span></span></span></span></span> and then <span class="mathjax-inline-container mjpage"><span class="mjx-chtml"><span class="mjx-math" aria-label="B"><span class="mjx-mrow" aria-hidden="true"><span class="mjx-mi"><span class="mjx-char MJXc-TeX-math-I" style="padding-top: 0.446em; padding-bottom: 0.298em;">B</span></span></span></span></span></span> for <span class="mathjax-inline-container mjpage"><span class="mjx-chtml"><span class="mjx-math" aria-label="C"><span class="mjx-mrow" aria-hidden="true"><span class="mjx-mi"><span class="mjx-char MJXc-TeX-math-I" style="padding-top: 0.519em; padding-bottom: 0.298em; padding-right: 0.045em;">C</span></span></span></span></span></span> (perhaps in exchange for a trivial amount of money). But, then, you’ll have <span class="mathjax-inline-container mjpage"><span class="mjx-chtml"><span class="mjx-math" aria-label="C"><span class="mjx-mrow" aria-hidden="true"><span class="mjx-mi"><span class="mjx-char MJXc-TeX-math-I" style="padding-top: 0.519em; padding-bottom: 0.298em; padding-right: 0.045em;">C</span></span></span></span></span></span>; and since <span class="mathjax-inline-container mjpage"><span class="mjx-chtml"><span class="mjx-math" aria-label="C \prec A"><span class="mjx-mrow" aria-hidden="true"><span class="mjx-mi"><span class="mjx-char MJXc-TeX-math-I" style="padding-top: 0.519em; padding-bottom: 0.298em; padding-right: 0.045em;">C</span></span><span class="mjx-mo MJXc-space3"><span class="mjx-char MJXc-TeX-main-R" style="padding-top: 0.225em; padding-bottom: 0.372em;">≺</span></span><span class="mjx-mi MJXc-space3"><span class="mjx-char MJXc-TeX-math-I" style="padding-top: 0.519em; padding-bottom: 0.298em;">A</span></span></span></span></span></span>, you’ll gladly pay (a non-trivial amount) to switch back to <span class="mathjax-inline-container mjpage"><span class="mjx-chtml"><span class="mjx-math" aria-label="A"><span class="mjx-mrow" aria-hidden="true"><span class="mjx-mi"><span class="mjx-char MJXc-TeX-math-I" style="padding-top: 0.519em; padding-bottom: 0.298em;">A</span></span></span></span></span></span>. I can keep sending you through this loop to get more money out of you until you’re broke.</p>
<p>The money-pump argument seems similar in nature to the Dutch Book argument; both require a slightly unnatural setup (making the assumption that utility is always exchangeable with money), but resulting in strong consequentialist justifications for rationality axioms. So, I place the money-pump argument (and thus transitivity) in my “first tier” along with Dutch Book.</p>
<p>Completeness is less clear. According to the <a href="https://plato.stanford.edu/entries/decision-theory/#WhaPreOvePro">SEP</a>, “most decision theorists suggest that rationality requires that preferences be coherently extendible. This means that even if your preferences are not complete, it should be possible to complete them without violating any of the conditions that are rationally required, in particular Transitivity.” So, I suggest we place this in a third tier, the so-called <em>structural</em> axioms: those which are not really justified at all, except that assuming them allows us to prove our results.</p>
<p>“Structural axioms” are a somewhat curious artefact found in almost all of the axiom-sets which we will look at. These axioms usually have something to do with requiring that the domain is rich enough for the intended proof to go through. Completeness is not usually referred to as structural, but if we agree with the quotation above, I think we have to regard it as such.</p>
<p>I take the axiom of independence to be tier two: an intuitively strong rationality principle, but not one that’s enforced by nasty things that happen if we violate it. It surprises me that I’ve only seen this kind of justification for <em>one</em> of the four VNM axioms. Actually, I suspect that independence <em>could</em> be justified in a tier-one way; it’s just that I haven’t seen it. (Developing a framework in which an argument for independence can be made just as well as the money-pump and dutch-book arguments is part of my goal.)</p>
<p>I think many people would put continuity at tier two, a strong intuitive principle. I don’t see why, personally. For me, it seems like an assumption which only makes sense if we already have the intuition that expected utility is going to be the right way of doing things. This puts it in tier 3 for me; another structural axiom. (The analogs of continuity in the rest of the decision theories I’ll mention come off as <em>very</em> structural.)</p>
<p><em>Savage</em></p>
<p>Leonard Savage took on the task of providing simultaneous justification of the entire Bayesian decision theory, grounding subjective probability and expected utility in one set of axioms. I won’t describe the entire framework, as it’s fairly complicated; see the <a href="https://plato.stanford.edu/entries/decision-theory/#SavThe">SEP section</a>. I will note several features of it, though:</p>
<ul><li><p>Savage makes the somewhat peculiar move of separating the objects of belief (“states”) and objects of desire (“outcomes”). How we go about separating parts of the world into one or the other seems quite unclear.</p></li><li><p>He replaces the gambles from VNM with “acts”: an act is a function from states to outcomes (he’s practically begging us to make terrible puns about his “savage acts”). Just as the VNM theorem requires us to assume that the agent has preferences on all lotteries, Savage’s theorem requires the agent to have preferences over all acts; that is, all functions from states to outcomes. Some of these may be quite absurd.</p></li><li><p>As the paper <a href="https://philpapers.org/rec/MANAR-2">Actualist Rationality</a> complains, Savage’s justification for his axioms is quite deontological; he is primarily saying that if you noticed any violation of the axioms in yourself, you would feel there’s something wrong with your thinking and you would want to correct it somehow. This doesn’t mean <em>we</em> can’t put some of his axioms in tier 1; after all, he’s got a transitivity axiom like everyone else. However, on Savage’s account, it’s all what I’d call tier-two justification.</p></li><li><p>Savage certainly has what I’d call tier-three axioms, as well. The SEP article identifies P5 and P6 as such. His axiom P6 requires that there exist world-states which are sufficiently improbable so as to make even the worst possible consequences negligible. Surely it can’t be a “requirement of rationality” that the state-space be complex enough to contain negligible possibilities; this is just something he needs to prove his theorem. P6 is Savage’s analog of the continuity axiom.</p></li><li><p>Savage chooses not to define probabilities on a sigma-algebra. I haven’t seen any decision-theorist who prefers to use sigma-algebras yet. Similarly, he only derives finite additivity, not countable additivity; this also seems common among decision theorists.</p></li><li><p>Savage’s representation theorem shows that if his axioms are followed, there exists a unique probability distribution and a utility function which is unique up to a linear transformation, such that the preference relation on acts is also the ordering with respect to expected utility.</p></li></ul>
<p><em>Jeffrey-Bolker Axioms</em></p>
<p>In contrast to Savage, Jeffrey’s decision theory makes the objects of belief and the objects of desire the same. Both belief and desire are functions of logical propositions.</p>
<p>The most common axiomatization is Bolker’s. We assume that there is a boolean field, with a preference relation <span class="mathjax-inline-container mjpage"><span class="mjx-chtml"><span class="mjx-math" aria-label="\prec"><span class="mjx-mrow" aria-hidden="true"><span class="mjx-mo"><span class="mjx-char MJXc-TeX-main-R" style="padding-top: 0.225em; padding-bottom: 0.372em;">≺</span></span></span></span></span></span>, following these axioms:</p>
<ol><li><p><span class="mathjax-inline-container mjpage"><span class="mjx-chtml"><span class="mjx-math" aria-label="\prec"><span class="mjx-mrow" aria-hidden="true"><span class="mjx-mo"><span class="mjx-char MJXc-TeX-main-R" style="padding-top: 0.225em; padding-bottom: 0.372em;">≺</span></span></span></span></span></span> is transitive and complete.<span class="mathjax-inline-container mjpage"><span class="mjx-chtml"><span class="mjx-math" aria-label="\prec"><span class="mjx-mrow" aria-hidden="true"><span class="mjx-mo"><span class="mjx-char MJXc-TeX-main-R" style="padding-top: 0.225em; padding-bottom: 0.372em;">≺</span></span></span></span></span></span> is defined on all elements of the field except <span class="mathjax-inline-container mjpage"><span class="mjx-chtml"><span class="mjx-math" aria-label="\bot"><span class="mjx-mrow" aria-hidden="true"><span class="mjx-mi"><span class="mjx-char MJXc-TeX-main-R" style="padding-top: 0.372em; padding-bottom: 0.372em;">⊥</span></span></span></span></span></span>. (Jeffrey does not wish to require preferences over propositions which the agent believes to be impossible, in contrast to Savage.)</p></li><li><p>The boolean field is complete and atomless. More specifically:
<ul><li><p>An <em>upper bound</em> of a (possibly infinite) set of propositions is a proposition implied by every proposition in that set. The <em>supremum</em> of is an upper bound which implies every upper bound. Define lower bound and infimum analogously. A <em>complete</em> Boolean algebra is one in which every set of propositions has a supremum and an infimum.</p></li><li><p>An <em>atom</em> is a proposition other than <span class="mathjax-inline-container mjpage"><span class="mjx-chtml"><span class="mjx-math" aria-label="\bot"><span class="mjx-mrow" aria-hidden="true"><span class="mjx-mi"><span class="mjx-char MJXc-TeX-main-R" style="padding-top: 0.372em; padding-bottom: 0.372em;">⊥</span></span></span></span></span></span> which is implied by itself and <span class="mathjax-inline-container mjpage"><span class="mjx-chtml"><span class="mjx-math" aria-label="\bot"><span class="mjx-mrow" aria-hidden="true"><span class="mjx-mi"><span class="mjx-char MJXc-TeX-main-R" style="padding-top: 0.372em; padding-bottom: 0.372em;">⊥</span></span></span></span></span></span>, but by no other propositions. An atomless Boolean algebra has no atoms.</p></li></ul>
</p></li><li><p>(Law of Averaging.) If <span class="mathjax-inline-container mjpage"><span class="mjx-chtml"><span class="mjx-math" aria-label="A \wedge B = \bot"><span class="mjx-mrow" aria-hidden="true"><span class="mjx-mi"><span class="mjx-char MJXc-TeX-math-I" style="padding-top: 0.519em; padding-bottom: 0.298em;">A</span></span><span class="mjx-mo MJXc-space2"><span class="mjx-char MJXc-TeX-main-R" style="padding-top: 0.298em; padding-bottom: 0.372em;">∧</span></span><span class="mjx-mi MJXc-space2"><span class="mjx-char MJXc-TeX-math-I" style="padding-top: 0.446em; padding-bottom: 0.298em;">B</span></span><span class="mjx-mo MJXc-space3"><span class="mjx-char MJXc-TeX-main-R" style="padding-top: 0.077em; padding-bottom: 0.298em;">=</span></span><span class="mjx-mi MJXc-space3"><span class="mjx-char MJXc-TeX-main-R" style="padding-top: 0.372em; padding-bottom: 0.372em;">⊥</span></span></span></span></span></span>,
<ul><li><p>If <span class="mathjax-inline-container mjpage"><span class="mjx-chtml"><span class="mjx-math" aria-label="A \prec B"><span class="mjx-mrow" aria-hidden="true"><span class="mjx-mi"><span class="mjx-char MJXc-TeX-math-I" style="padding-top: 0.519em; padding-bottom: 0.298em;">A</span></span><span class="mjx-mo MJXc-space3"><span class="mjx-char MJXc-TeX-main-R" style="padding-top: 0.225em; padding-bottom: 0.372em;">≺</span></span><span class="mjx-mi MJXc-space3"><span class="mjx-char MJXc-TeX-math-I" style="padding-top: 0.446em; padding-bottom: 0.298em;">B</span></span></span></span></span></span>, then <span class="mathjax-inline-container mjpage"><span class="mjx-chtml"><span class="mjx-math" aria-label="A \prec A \vee B \prec B"><span class="mjx-mrow" aria-hidden="true"><span class="mjx-mi"><span class="mjx-char MJXc-TeX-math-I" style="padding-top: 0.519em; padding-bottom: 0.298em;">A</span></span><span class="mjx-mo MJXc-space3"><span class="mjx-char MJXc-TeX-main-R" style="padding-top: 0.225em; padding-bottom: 0.372em;">≺</span></span><span class="mjx-mi MJXc-space3"><span class="mjx-char MJXc-TeX-math-I" style="padding-top: 0.519em; padding-bottom: 0.298em;">A</span></span><span class="mjx-mo MJXc-space2"><span class="mjx-char MJXc-TeX-main-R" style="padding-top: 0.298em; padding-bottom: 0.372em;">∨</span></span><span class="mjx-mi MJXc-space2"><span class="mjx-char MJXc-TeX-math-I" style="padding-top: 0.446em; padding-bottom: 0.298em;">B</span></span><span class="mjx-mo MJXc-space3"><span class="mjx-char MJXc-TeX-main-R" style="padding-top: 0.225em; padding-bottom: 0.372em;">≺</span></span><span class="mjx-mi MJXc-space3"><span class="mjx-char MJXc-TeX-math-I" style="padding-top: 0.446em; padding-bottom: 0.298em;">B</span></span></span></span></span></span></p></li><li><p>If <span class="mathjax-inline-container mjpage"><span class="mjx-chtml"><span class="mjx-math" aria-label="A \sim B"><span class="mjx-mrow" aria-hidden="true"><span class="mjx-mi"><span class="mjx-char MJXc-TeX-math-I" style="padding-top: 0.519em; padding-bottom: 0.298em;">A</span></span><span class="mjx-mo MJXc-space3"><span class="mjx-char MJXc-TeX-main-R" style="padding-top: 0.077em; padding-bottom: 0.298em;">∼</span></span><span class="mjx-mi MJXc-space3"><span class="mjx-char MJXc-TeX-math-I" style="padding-top: 0.446em; padding-bottom: 0.298em;">B</span></span></span></span></span></span>, then <span class="mathjax-inline-container mjpage"><span class="mjx-chtml"><span class="mjx-math" aria-label="A \sim A \vee B \sim B"><span class="mjx-mrow" aria-hidden="true"><span class="mjx-mi"><span class="mjx-char MJXc-TeX-math-I" style="padding-top: 0.519em; padding-bottom: 0.298em;">A</span></span><span class="mjx-mo MJXc-space3"><span class="mjx-char MJXc-TeX-main-R" style="padding-top: 0.077em; padding-bottom: 0.298em;">∼</span></span><span class="mjx-mi MJXc-space3"><span class="mjx-char MJXc-TeX-math-I" style="padding-top: 0.519em; padding-bottom: 0.298em;">A</span></span><span class="mjx-mo MJXc-space2"><span class="mjx-char MJXc-TeX-main-R" style="padding-top: 0.298em; padding-bottom: 0.372em;">∨</span></span><span class="mjx-mi MJXc-space2"><span class="mjx-char MJXc-TeX-math-I" style="padding-top: 0.446em; padding-bottom: 0.298em;">B</span></span><span class="mjx-mo MJXc-space3"><span class="mjx-char MJXc-TeX-main-R" style="padding-top: 0.077em; padding-bottom: 0.298em;">∼</span></span><span class="mjx-mi MJXc-space3"><span class="mjx-char MJXc-TeX-math-I" style="padding-top: 0.446em; padding-bottom: 0.298em;">B</span></span></span></span></span></span></p></li></ul>
</p></li><li><p>(Impartiality.) If <span class="mathjax-inline-container mjpage"><span class="mjx-chtml"><span class="mjx-math" aria-label="A \wedge B = \bot"><span class="mjx-mrow" aria-hidden="true"><span class="mjx-mi"><span class="mjx-char MJXc-TeX-math-I" style="padding-top: 0.519em; padding-bottom: 0.298em;">A</span></span><span class="mjx-mo MJXc-space2"><span class="mjx-char MJXc-TeX-main-R" style="padding-top: 0.298em; padding-bottom: 0.372em;">∧</span></span><span class="mjx-mi MJXc-space2"><span class="mjx-char MJXc-TeX-math-I" style="padding-top: 0.446em; padding-bottom: 0.298em;">B</span></span><span class="mjx-mo MJXc-space3"><span class="mjx-char MJXc-TeX-main-R" style="padding-top: 0.077em; padding-bottom: 0.298em;">=</span></span><span class="mjx-mi MJXc-space3"><span class="mjx-char MJXc-TeX-main-R" style="padding-top: 0.372em; padding-bottom: 0.372em;">⊥</span></span></span></span></span></span> and <span class="mathjax-inline-container mjpage"><span class="mjx-chtml"><span class="mjx-math" aria-label="A \sim B"><span class="mjx-mrow" aria-hidden="true"><span class="mjx-mi"><span class="mjx-char MJXc-TeX-math-I" style="padding-top: 0.519em; padding-bottom: 0.298em;">A</span></span><span class="mjx-mo MJXc-space3"><span class="mjx-char MJXc-TeX-main-R" style="padding-top: 0.077em; padding-bottom: 0.298em;">∼</span></span><span class="mjx-mi MJXc-space3"><span class="mjx-char MJXc-TeX-math-I" style="padding-top: 0.446em; padding-bottom: 0.298em;">B</span></span></span></span></span></span>, then if <span class="mathjax-inline-container mjpage"><span class="mjx-chtml"><span class="mjx-math" aria-label="A \vee C \sim B \vee C"><span class="mjx-mrow" aria-hidden="true"><span class="mjx-mi"><span class="mjx-char MJXc-TeX-math-I" style="padding-top: 0.519em; padding-bottom: 0.298em;">A</span></span><span class="mjx-mo MJXc-space2"><span class="mjx-char MJXc-TeX-main-R" style="padding-top: 0.298em; padding-bottom: 0.372em;">∨</span></span><span class="mjx-mi MJXc-space2"><span class="mjx-char MJXc-TeX-math-I" style="padding-top: 0.519em; padding-bottom: 0.298em; padding-right: 0.045em;">C</span></span><span class="mjx-mo MJXc-space3"><span class="mjx-char MJXc-TeX-main-R" style="padding-top: 0.077em; padding-bottom: 0.298em;">∼</span></span><span class="mjx-mi MJXc-space3"><span class="mjx-char MJXc-TeX-math-I" style="padding-top: 0.446em; padding-bottom: 0.298em;">B</span></span><span class="mjx-mo MJXc-space2"><span class="mjx-char MJXc-TeX-main-R" style="padding-top: 0.298em; padding-bottom: 0.372em;">∨</span></span><span class="mjx-mi MJXc-space2"><span class="mjx-char MJXc-TeX-math-I" style="padding-top: 0.519em; padding-bottom: 0.298em; padding-right: 0.045em;">C</span></span></span></span></span></span> for some <span class="mathjax-inline-container mjpage"><span class="mjx-chtml"><span class="mjx-math" aria-label="C"><span class="mjx-mrow" aria-hidden="true"><span class="mjx-mi"><span class="mjx-char MJXc-TeX-math-I" style="padding-top: 0.519em; padding-bottom: 0.298em; padding-right: 0.045em;">C</span></span></span></span></span></span> where <span class="mathjax-inline-container mjpage"><span class="mjx-chtml"><span class="mjx-math" aria-label="AC = BC = \bot"><span class="mjx-mrow" aria-hidden="true"><span class="mjx-mi"><span class="mjx-char MJXc-TeX-math-I" style="padding-top: 0.519em; padding-bottom: 0.298em;">A</span></span><span class="mjx-mi"><span class="mjx-char MJXc-TeX-math-I" style="padding-top: 0.519em; padding-bottom: 0.298em; padding-right: 0.045em;">C</span></span><span class="mjx-mo MJXc-space3"><span class="mjx-char MJXc-TeX-main-R" style="padding-top: 0.077em; padding-bottom: 0.298em;">=</span></span><span class="mjx-mi MJXc-space3"><span class="mjx-char MJXc-TeX-math-I" style="padding-top: 0.446em; padding-bottom: 0.298em;">B</span></span><span class="mjx-mi"><span class="mjx-char MJXc-TeX-math-I" style="padding-top: 0.519em; padding-bottom: 0.298em; padding-right: 0.045em;">C</span></span><span class="mjx-mo MJXc-space3"><span class="mjx-char MJXc-TeX-main-R" style="padding-top: 0.077em; padding-bottom: 0.298em;">=</span></span><span class="mjx-mi MJXc-space3"><span class="mjx-char MJXc-TeX-main-R" style="padding-top: 0.372em; padding-bottom: 0.372em;">⊥</span></span></span></span></span></span> and not <span class="mathjax-inline-container mjpage"><span class="mjx-chtml"><span class="mjx-math" aria-label="C \sim A"><span class="mjx-mrow" aria-hidden="true"><span class="mjx-mi"><span class="mjx-char MJXc-TeX-math-I" style="padding-top: 0.519em; padding-bottom: 0.298em; padding-right: 0.045em;">C</span></span><span class="mjx-mo MJXc-space3"><span class="mjx-char MJXc-TeX-main-R" style="padding-top: 0.077em; padding-bottom: 0.298em;">∼</span></span><span class="mjx-mi MJXc-space3"><span class="mjx-char MJXc-TeX-math-I" style="padding-top: 0.519em; padding-bottom: 0.298em;">A</span></span></span></span></span></span>, then <span class="mathjax-inline-container mjpage"><span class="mjx-chtml"><span class="mjx-math" aria-label="A \vee C \sim B \vee C"><span class="mjx-mrow" aria-hidden="true"><span class="mjx-mi"><span class="mjx-char MJXc-TeX-math-I" style="padding-top: 0.519em; padding-bottom: 0.298em;">A</span></span><span class="mjx-mo MJXc-space2"><span class="mjx-char MJXc-TeX-main-R" style="padding-top: 0.298em; padding-bottom: 0.372em;">∨</span></span><span class="mjx-mi MJXc-space2"><span class="mjx-char MJXc-TeX-math-I" style="padding-top: 0.519em; padding-bottom: 0.298em; padding-right: 0.045em;">C</span></span><span class="mjx-mo MJXc-space3"><span class="mjx-char MJXc-TeX-main-R" style="padding-top: 0.077em; padding-bottom: 0.298em;">∼</span></span><span class="mjx-mi MJXc-space3"><span class="mjx-char MJXc-TeX-math-I" style="padding-top: 0.446em; padding-bottom: 0.298em;">B</span></span><span class="mjx-mo MJXc-space2"><span class="mjx-char MJXc-TeX-main-R" style="padding-top: 0.298em; padding-bottom: 0.372em;">∨</span></span><span class="mjx-mi MJXc-space2"><span class="mjx-char MJXc-TeX-math-I" style="padding-top: 0.519em; padding-bottom: 0.298em; padding-right: 0.045em;">C</span></span></span></span></span></span> for every such <span class="mathjax-inline-container mjpage"><span class="mjx-chtml"><span class="mjx-math" aria-label="C"><span class="mjx-mrow" aria-hidden="true"><span class="mjx-mi"><span class="mjx-char MJXc-TeX-math-I" style="padding-top: 0.519em; padding-bottom: 0.298em; padding-right: 0.045em;">C</span></span></span></span></span></span>.</p></li><li><p>(Continuity.) Suppose that <span class="mathjax-inline-container mjpage"><span class="mjx-chtml"><span class="mjx-math" aria-label="X"><span class="mjx-mrow" aria-hidden="true"><span class="mjx-mi"><span class="mjx-char MJXc-TeX-math-I" style="padding-top: 0.446em; padding-bottom: 0.298em; padding-right: 0.024em;">X</span></span></span></span></span></span> is the supremum (infimum) of a set of propositions <span class="mathjax-inline-container mjpage"><span class="mjx-chtml"><span class="mjx-math" aria-label="\mathcal{S}"><span class="mjx-mrow" aria-hidden="true"><span class="mjx-texatom"><span class="mjx-mrow"><span class="mjx-mi"><span class="mjx-char MJXc-TeX-cal-R" style="padding-top: 0.446em; padding-bottom: 0.372em; padding-right: 0.036em;">S</span></span></span></span></span></span></span></span>, and <span class="mathjax-inline-container mjpage"><span class="mjx-chtml"><span class="mjx-math" aria-label="A \prec X \prec B"><span class="mjx-mrow" aria-hidden="true"><span class="mjx-mi"><span class="mjx-char MJXc-TeX-math-I" style="padding-top: 0.519em; padding-bottom: 0.298em;">A</span></span><span class="mjx-mo MJXc-space3"><span class="mjx-char MJXc-TeX-main-R" style="padding-top: 0.225em; padding-bottom: 0.372em;">≺</span></span><span class="mjx-mi MJXc-space3"><span class="mjx-char MJXc-TeX-math-I" style="padding-top: 0.446em; padding-bottom: 0.298em; padding-right: 0.024em;">X</span></span><span class="mjx-mo MJXc-space3"><span class="mjx-char MJXc-TeX-main-R" style="padding-top: 0.225em; padding-bottom: 0.372em;">≺</span></span><span class="mjx-mi MJXc-space3"><span class="mjx-char MJXc-TeX-math-I" style="padding-top: 0.446em; padding-bottom: 0.298em;">B</span></span></span></span></span></span>. Then there exists <span class="mathjax-inline-container mjpage"><span class="mjx-chtml"><span class="mjx-math" aria-label="C \in \mathcal{S}"><span class="mjx-mrow" aria-hidden="true"><span class="mjx-mi"><span class="mjx-char MJXc-TeX-math-I" style="padding-top: 0.519em; padding-bottom: 0.298em; padding-right: 0.045em;">C</span></span><span class="mjx-mo MJXc-space3"><span class="mjx-char MJXc-TeX-main-R" style="padding-top: 0.225em; padding-bottom: 0.372em;">∈</span></span><span class="mjx-texatom MJXc-space3"><span class="mjx-mrow"><span class="mjx-mi"><span class="mjx-char MJXc-TeX-cal-R" style="padding-top: 0.446em; padding-bottom: 0.372em; padding-right: 0.036em;">S</span></span></span></span></span></span></span></span> such that if <span class="mathjax-inline-container mjpage"><span class="mjx-chtml"><span class="mjx-math" aria-label="D \in \mathcal{S}"><span class="mjx-mrow" aria-hidden="true"><span class="mjx-mi"><span class="mjx-char MJXc-TeX-math-I" style="padding-top: 0.446em; padding-bottom: 0.298em;">D</span></span><span class="mjx-mo MJXc-space3"><span class="mjx-char MJXc-TeX-main-R" style="padding-top: 0.225em; padding-bottom: 0.372em;">∈</span></span><span class="mjx-texatom MJXc-space3"><span class="mjx-mrow"><span class="mjx-mi"><span class="mjx-char MJXc-TeX-cal-R" style="padding-top: 0.446em; padding-bottom: 0.372em; padding-right: 0.036em;">S</span></span></span></span></span></span></span></span> is implied by <span class="mathjax-inline-container mjpage"><span class="mjx-chtml"><span class="mjx-math" aria-label="C"><span class="mjx-mrow" aria-hidden="true"><span class="mjx-mi"><span class="mjx-char MJXc-TeX-math-I" style="padding-top: 0.519em; padding-bottom: 0.298em; padding-right: 0.045em;">C</span></span></span></span></span></span> (or where <span class="mathjax-inline-container mjpage"><span class="mjx-chtml"><span class="mjx-math" aria-label="X"><span class="mjx-mrow" aria-hidden="true"><span class="mjx-mi"><span class="mjx-char MJXc-TeX-math-I" style="padding-top: 0.446em; padding-bottom: 0.298em; padding-right: 0.024em;">X</span></span></span></span></span></span> is the infimum, implies <span class="mathjax-inline-container mjpage"><span class="mjx-chtml"><span class="mjx-math" aria-label="C"><span class="mjx-mrow" aria-hidden="true"><span class="mjx-mi"><span class="mjx-char MJXc-TeX-math-I" style="padding-top: 0.519em; padding-bottom: 0.298em; padding-right: 0.045em;">C</span></span></span></span></span></span>), then <span class="mathjax-inline-container mjpage"><span class="mjx-chtml"><span class="mjx-math" aria-label="A \prec D \prec B"><span class="mjx-mrow" aria-hidden="true"><span class="mjx-mi"><span class="mjx-char MJXc-TeX-math-I" style="padding-top: 0.519em; padding-bottom: 0.298em;">A</span></span><span class="mjx-mo MJXc-space3"><span class="mjx-char MJXc-TeX-main-R" style="padding-top: 0.225em; padding-bottom: 0.372em;">≺</span></span><span class="mjx-mi MJXc-space3"><span class="mjx-char MJXc-TeX-math-I" style="padding-top: 0.446em; padding-bottom: 0.298em;">D</span></span><span class="mjx-mo MJXc-space3"><span class="mjx-char MJXc-TeX-main-R" style="padding-top: 0.225em; padding-bottom: 0.372em;">≺</span></span><span class="mjx-mi MJXc-space3"><span class="mjx-char MJXc-TeX-math-I" style="padding-top: 0.446em; padding-bottom: 0.298em;">B</span></span></span></span></span></span>.</p></li></ol>
<p>The central axiom to Jeffrey’s decision theory is the law of averaging. This can be seen as a kind of consequentialism. If I violate this axiom, I would either value some gamble <span class="mathjax-inline-container mjpage"><span class="mjx-chtml"><span class="mjx-math" aria-label="A \vee B"><span class="mjx-mrow" aria-hidden="true"><span class="mjx-mi"><span class="mjx-char MJXc-TeX-math-I" style="padding-top: 0.519em; padding-bottom: 0.298em;">A</span></span><span class="mjx-mo MJXc-space2"><span class="mjx-char MJXc-TeX-main-R" style="padding-top: 0.298em; padding-bottom: 0.372em;">∨</span></span><span class="mjx-mi MJXc-space2"><span class="mjx-char MJXc-TeX-math-I" style="padding-top: 0.446em; padding-bottom: 0.298em;">B</span></span></span></span></span></span> less than both its possible outcomes <span class="mathjax-inline-container mjpage"><span class="mjx-chtml"><span class="mjx-math" aria-label="A"><span class="mjx-mrow" aria-hidden="true"><span class="mjx-mi"><span class="mjx-char MJXc-TeX-math-I" style="padding-top: 0.519em; padding-bottom: 0.298em;">A</span></span></span></span></span></span> and <span class="mathjax-inline-container mjpage"><span class="mjx-chtml"><span class="mjx-math" aria-label="B"><span class="mjx-mrow" aria-hidden="true"><span class="mjx-mi"><span class="mjx-char MJXc-TeX-math-I" style="padding-top: 0.446em; padding-bottom: 0.298em;">B</span></span></span></span></span></span>, or value it more. In the first case, we could charge an agent for switching from the gamble <span class="mathjax-inline-container mjpage"><span class="mjx-chtml"><span class="mjx-math" aria-label="A \vee B"><span class="mjx-mrow" aria-hidden="true"><span class="mjx-mi"><span class="mjx-char MJXc-TeX-math-I" style="padding-top: 0.519em; padding-bottom: 0.298em;">A</span></span><span class="mjx-mo MJXc-space2"><span class="mjx-char MJXc-TeX-main-R" style="padding-top: 0.298em; padding-bottom: 0.372em;">∨</span></span><span class="mjx-mi MJXc-space2"><span class="mjx-char MJXc-TeX-math-I" style="padding-top: 0.446em; padding-bottom: 0.298em;">B</span></span></span></span></span></span> to <span class="mathjax-inline-container mjpage"><span class="mjx-chtml"><span class="mjx-math" aria-label="A"><span class="mjx-mrow" aria-hidden="true"><span class="mjx-mi"><span class="mjx-char MJXc-TeX-math-I" style="padding-top: 0.519em; padding-bottom: 0.298em;">A</span></span></span></span></span></span>; this would worsen the agent’s situation, since one of <span class="mathjax-inline-container mjpage"><span class="mjx-chtml"><span class="mjx-math" aria-label="A"><span class="mjx-mrow" aria-hidden="true"><span class="mjx-mi"><span class="mjx-char MJXc-TeX-math-I" style="padding-top: 0.519em; padding-bottom: 0.298em;">A</span></span></span></span></span></span> or <span class="mathjax-inline-container mjpage"><span class="mjx-chtml"><span class="mjx-math" aria-label="B"><span class="mjx-mrow" aria-hidden="true"><span class="mjx-mi"><span class="mjx-char MJXc-TeX-math-I" style="padding-top: 0.446em; padding-bottom: 0.298em;">B</span></span></span></span></span></span> was true already, <span class="mathjax-inline-container mjpage"><span class="mjx-chtml"><span class="mjx-math" aria-label="A \preceq B"><span class="mjx-mrow" aria-hidden="true"><span class="mjx-mi"><span class="mjx-char MJXc-TeX-math-I" style="padding-top: 0.519em; padding-bottom: 0.298em;">A</span></span><span class="mjx-mo MJXc-space3"><span class="mjx-char MJXc-TeX-main-R" style="padding-top: 0.372em; padding-bottom: 0.446em;">⪯</span></span><span class="mjx-mi MJXc-space3"><span class="mjx-char MJXc-TeX-math-I" style="padding-top: 0.446em; padding-bottom: 0.298em;">B</span></span></span></span></span></span>, and the agent has just lost money. In the other case, we can set up a proper money pump: charge the agent to keep switching to the gamble <span class="mathjax-inline-container mjpage"><span class="mjx-chtml"><span class="mjx-math" aria-label="A \vee B"><span class="mjx-mrow" aria-hidden="true"><span class="mjx-mi"><span class="mjx-char MJXc-TeX-math-I" style="padding-top: 0.519em; padding-bottom: 0.298em;">A</span></span><span class="mjx-mo MJXc-space2"><span class="mjx-char MJXc-TeX-main-R" style="padding-top: 0.298em; padding-bottom: 0.372em;">∨</span></span><span class="mjx-mi MJXc-space2"><span class="mjx-char MJXc-TeX-math-I" style="padding-top: 0.446em; padding-bottom: 0.298em;">B</span></span></span></span></span></span>, which it will happily do whichever of <span class="mathjax-inline-container mjpage"><span class="mjx-chtml"><span class="mjx-math" aria-label="A"><span class="mjx-mrow" aria-hidden="true"><span class="mjx-mi"><span class="mjx-char MJXc-TeX-math-I" style="padding-top: 0.519em; padding-bottom: 0.298em;">A</span></span></span></span></span></span> or <span class="mathjax-inline-container mjpage"><span class="mjx-chtml"><span class="mjx-math" aria-label="B"><span class="mjx-mrow" aria-hidden="true"><span class="mjx-mi"><span class="mjx-char MJXc-TeX-math-I" style="padding-top: 0.446em; padding-bottom: 0.298em;">B</span></span></span></span></span></span> come out true.</p>
<p>So, I tentatively put axiom 3 in my first tier (pending better formalization of that argument).</p>
<p>I’ve already dealt with axiom 1, since it’s just the first two axioms of VNM rolled into one: I count transitivity as tier one, and completeness as tier two.</p>
<p>Axioms two and five are clearly structural, so I place them in my third tier. Bolker is essentially setting things up so that there will be an isomorphism to the real numbers when he derives the existence of a probability and utility distribution from the axioms.</p>
<p>Axiom 4 has to be considered structural in the sense I’m using here, as well. Jeffrey admits that there is no intuitive motivation for it unless you already think of propositions as having some kind of measure which determines their relative contribution to expected utility. If you do have such an intuition, axiom 4 is just saying that propositions whose weight is equal in one context must have equal weight in all contexts. (Savage needs a similar axiom which says that probabilities do not change in different contexts.)</p>
<p>Unlike Savage’s, Bolker’s representation theorem does not give us a unique probability distribution. Instead, we can trade between utility and probability via a certain formula. Probability zero events are not distinguishable from events which cause the utilities of all sub-events to be constant.</p>
<p><em>Jeffrey-Domotor Axioms</em></p>
<p>Zoltan Domotor provides <a href="https://philpapers.org/rec/DOMAOJ">an alternative set of axioms</a> for Jeffrey’s decision theory. Domotor points out that Bolker’s axioms are sufficient, but not necessary, for his representation theorem. He sets out to construct a necessary and sufficient axiomatization. This necessitates dealing with finite and incomplete boolean fields. The result is a representation theorem which allows nonstandard reals; we can have infinitesimal probabilities, and infinitesimal or infinite utilities. So, we have a second point of evidence in favor of that.</p>
<p>Although looking for necessary <em>and</em> sufficient conditions seems promising as a way of eliminating structural assumptions like completeness and atomlessness, it ends up making <em>all</em> axioms structural. In fact, Domotor gives essentially one significant axiom: his axiom J2. J2 is totally inscrutable without a careful reading of the notation introduced in his paper; it would be pointless to reproduce it here. The axiom is chosen to exactly state the conditions for the existence of a probability and utility function, and can’t be justified in any other way—at least not without providing a full justification for Jeffrey’s decision theory by other means!</p>
<p>Another consequence of Domotor’s axiomatization is that the representation becomes <em>wildly</em> non-unique. This has to be true for a representation theorem dealing with finite situations, since there is a lot of wiggle room in what probabilities and utilities represent preferences over finite domains. It gets even worse with the addition of infinitesimals, though; the choice of nonstandard-real field confronts us as well.</p>
<h1>Conditional Probability as Primitive</h1>
<p><em>Hajek</em></p>
<p>In <a href="http://philrsss.anu.edu.au/people-defaults/alanh/papers/what_cp_couldnt_be.pdf">What Conditional Probabilities Could Not Be</a>, Alan Hajek argues that conditional probability cannot possibly be defined by Bayes’ famous formula, due primarily to its inadequacy when conditioning on events of probability zero. He also takes issue with other proposed definitions, arguing that conditional probability should instead be taken as primitive.</p>
<p>The most popular way of doing this are Popper’s axioms of conditional probability. In <em>Learning the Impossible</em> (Vann McGee, 1994), it’s shown that conditional probability functions following Popper’s axioms and nonstandard-real probability functions with conditionals defined according to Bayes’ theorem are inter-translatable. Hajek doesn’t like the infinitesimal approach because of the resulting non-uniqueness of representation; but, for those who don’t see this as a problem but who put some stock in Hajek’s other arguments, this would be another point in favor of infinitesimal probability.</p>
<p><em>Richard Bradley</em></p>
<p>In <a href="https://philpapers.org/rec/BRAAUB">A unified Bayesian decision theory</a>, Richard Bradley shows that Savage’s and Jeffrey’s decision theories can be seen as special cases of a more general decision theory which takes conditional probabilities as a basic element. Bradley’s theory groups all the “structural” assumptions together, as axioms which postulate a rich set of “neutral” propositions (essentially, postulating a sufficiently rich set of coin-flips to measure the probabilities of other propositions against). He needs to specifically make an archimedean assumption to rule out nonstandard numbers, which could easily be dropped. He manages to derive a unique probability distribution in his representation theorem, as well.</p>
<h2>OK, So What?</h2>
<p>In general, I have hope that most of the tier-two axioms could become tier-one; that is, it seems possible to create a generalization of dutch-book/money-pump arguments which covers most of what decision theorists consider to be principles of rationality. I have an incomplete attempt which I’ll develop for a future post. I don’t expect tier-three axioms to be justifiable in this way.</p>
<p>With such a formalism in hand, the next step would be to try to derive a representation theorem: how can we understand the preferences of an agent which doesn’t fall into these generalized traps? I’m not sure what generalizations to expect beyond infinitesimal probability. It’s not even clear that such an agent’s preferences will always be representable as a probability function and utility function pair; some more complicated structure may be implicated (in which case it will likely be difficult to find!). This would tell us something new about what agents look like in general.</p>
<p>The generalized dutch-book would likely disallow preference functions which put agents in situations they’ll predictably regret. This sounds like a temporal consistency constraint; so, it might also justify updatelessness automatically or with a little modification. That would certainly be interesting.</p>
<p>And, as I said before, if we have this kind of foundation we can attempt to “do the same thing we did with logical induction” to get a decision theory which is appropriate for situations of logical uncertainty as well.</p>
</body>abramdemski5bd75cc58225bf0670375373Sat, 04 Mar 2017 16:46:11 +0000How to Measure Anything by lukeprog
https://www.greaterwrong.com/posts/ybYBCK9D7MZCcdArB/how-to-measure-anything
<p><a href="http://www.amazon.com/How-Measure-Anything-Intangibles-Business/dp/0470539399/"><div class="imgonly"><img loading="lazy" src="http://commonsenseatheism.com/wp-content/uploads/2013/08/how-to-measure-anything.jpeg" alt="" align="right"></div></a>Douglas Hubbard’s <em><a href="http://www.amazon.com/How-Measure-Anything-Intangibles-Business/dp/0470539399/">How to Measure Anything</a></em> is one of my favorite how-to books. I hope this summary inspires you to buy the book; it’s worth it.</p>
<p>The book opens:</p>
<blockquote>
<p>Anything can be measured. If a thing can be observed in any way at all, it lends itself to some type of measurement method. No matter how “fuzzy” the measurement is, it’s still a measurement if it tells you more than you knew before. And those very things most likely to be seen as immeasurable are, virtually always, solved by relatively simple measurement methods.</p>
</blockquote>
<p>The sciences have many established measurement methods, so Hubbard’s book focuses on the measurement of “business intangibles” that are important for decision-making but tricky to measure: things like management effectiveness, the “flexibility” to create new products, the risk of bankruptcy, and public image.</p>
<h3 id="basicideas">Basic Ideas</h3>
<p>A <em>measurement</em> is an observation that quantitatively reduces uncertainty. Measurements might not yield precise, certain judgments, but they <em>do</em> reduce your uncertainty.</p>
<p>To be measured, the <em>object of measurement</em> must be described clearly, in terms of observables. A good way to clarify a vague object of measurement like “IT security” is to ask “What is IT security, and why do you care?” Such probing can reveal that “IT security” means things like a reduction in unauthorized intrusions and malware attacks, which the IT department cares about because these things result in lost productivity, fraud losses, and legal liabilities.</p>
<p><em>Uncertainty</em> is the lack of certainty: the true outcome/state/value is not known.</p>
<p><em>Risk</em> is a state of uncertainty in which some of the possibilities involve a loss.</p>
<p>Much pessimism about measurement comes from a lack of experience making measurements. Hubbard, who is <em>far</em> more experienced with measurement than his readers, says:</p>
<ol><li><p>Your problem is not as unique as you think.</p></li><li><p>You have more data than you think.</p></li><li><p>You need less data than you think.</p></li><li><p>An adequate amount of new data is more accessible than you think.</p></li></ol>
<h3>Applied Information Economics</h3>
<p>Hubbard calls his method “Applied Information Economics” (AIE). It consists of 5 steps:</p>
<ol><li><p>Define a decision problem and the relevant variables. (Start with the decision you need to make, then figure out which variables would make your decision easier if you had better estimates of their values.)</p></li><li><p>Determine what you know. (Quantify your uncertainty about those variables in terms of ranges and probabilities.)</p></li><li><p>Pick a variable, and compute the value of additional information for that variable. (Repeat until you find a variable with reasonably high information value. If no remaining variables have enough information value to justify the cost of measuring them, skip to step 5.)</p></li><li><p>Apply the relevant measurement instrument(s) to the high-information-value variable. (Then go back to step 3.)</p></li><li><p>Make a decision and act on it. (When you’ve done as much uncertainty reduction as is economically justified, it’s time to act!)</p></li></ol>
<p>These steps are elaborated below.</p>
<h3>Step 1: Define a decision problem and the relevant variables</h3>
<p>Hubbard illustrates this step by telling the story of how he helped the Department of Veterans Affairs (VA) with a measurement problem.</p>
<p>The VA was considering seven proposed IT security projects. They wanted to know “which… of the proposed investments were justified and, after they were implemented, whether improvements in security justified further investment…” Hubbard asked his standard questions: “What do you mean by ‘IT security’? Why does it matter to you? What are you observing when you observe improved IT security?”</p>
<p>It became clear that <em>nobody</em> at the VA had thought about the details of what “IT security” meant to them. But after Hubbard’s probing, it became clear that by “IT security” they meant a reduction in the frequency and severity of some undesirable events: agency-wide virus attacks, unauthorized system access (external or internal),unauthorized physical access, and disasters affecting the IT infrastructure (fire, flood, etc.) And each undesirable event was on the list because of specific costs associated with it: productivity losses from virus attacks, legal liability from unauthorized system access, etc.</p>
<p>Now that the VA knew what they meant by “IT security,” they could measure specific variables, such as the number of virus attacks per year.</p>
<h3>Step 2: Determine what you know</h3>
<h4 id="uncertaintyandcalibration">Uncertainty and calibration</h4>
<p>The next step is to determine your level of uncertainty about the variables you want to measure. To do this, you can express a “confidence interval” (CI). A 90% CI is a range of values that is 90% likely to contain the correct value. For example, the security experts at the VA were 90% confident that each agency-wide virus attack would affect between 25,000 and 65,000 people.</p>
<p>Unfortunately, few people are well-calibrated estimators. For example in some studies, the true value lay in subjects’ 90% CIs only 50% of the time! These subjects were overconfident. For a well-calibrated estimator, the true value will lie in her 90% CI roughly 90% of the time.</p>
<p>Luckily, “assessing uncertainty is a general skill that can be taught with a measurable improvement.”</p>
<p>Hubbard uses several methods to calibrate each client’s value estimators, for example the security experts at the VA who needed to estimate the frequency of security breaches and their likely costs.</p>
<p>His first technique is the <em>equivalent bet test</em>. Suppose you’re asked to give a 90% CI for the year in which Newton published the universal laws of gravitation, and you can win $1,000 in one of two ways:</p>
<ol><li><p>You win $1,000 if the true year of publication falls within your 90% CI. Otherwise, you win nothing.</p></li><li><p>You spin a dial divided into two “pie slices,” one covering 10% of the dial, and the other covering 90%. If the dial lands on the small slice, you win nothing. If it lands on the big slice, you win $1,000.</p></li></ol>
<p>If you find yourself preferring option #2, then you must think spinning the dial has a higher chance of winning you $1,000 than option #1. That suggest your stated 90% CI isn’t really your 90% CI. Maybe it’s your 65% CI or your 80% CI instead. By preferring option #2, your brain is trying to tell you that your originally stated 90% CI is overconfident.</p>
<p>If instead you find yourself preferring option #1, then you must think there is <em>more</em> than a 90% chance your stated 90% CI contains the true value. By preferring option #1, your brain is trying to tell you that your original 90% CI is under confident.</p>
<p>To make a better estimate, adjust your 90% CI until option #1 and option #2 seem equally good to you. Research suggests that even <em>pretending</em> to bet money in this way will improve your calibration.</p>
<p>Hubbard’s second method for improving calibration is simply <em>repetition and feedback</em>. Make lots of estimates and then see how well you did. For this, play CFAR’s <a href="http://acritch.com/credence-game/">Calibration Game</a>.</p>
<p>Hubbard also asks people to identify reasons why a particular estimate might be right, and why it might be wrong.</p>
<p>He also asks people to look more closely at each bound (upper and lower) on their estimated range. A 90% CI “means there is a 5% chance the true value could be greater than the upper bound, and a 5% chance it could be less than the lower bound. This means the estimators must be 95% sure that the true value is less than the upper bound. If they are not that certain, they should increase the upper bound… A similar test is applied to the lower bound.”</p>
<h4>Simulations</h4>
<p>Once you determine what you know about the uncertainties involved, how can you use that information to determine what you know about the <em>risks</em> involved? Hubbard summarizes:</p>
<blockquote>
<p>…all risk in any project… can be expressed by one method: the ranges of uncertainty on the costs and benefits, and probabilities on events that might affect them.</p>
</blockquote>
<p>The simplest tool for measuring such risks accurately is the Monte Carlo (MC) simulation, which can be run by Excel and many other programs. To illustrate this tool, suppose you are wondering whether to lease a new machine for one step in your manufacturing process.</p>
<blockquote>
<p>The one-year lease [for the machine] is $400,000 with no option for early cancellation. So if you aren’t breaking even, you are still stuck with it for the rest of the year. You are considering signing the contract because you think the more advanced device will save some labor and raw materials and because you think the maintenance cost will be lower than the existing process.</p>
</blockquote>
<p>Your pre-calibrated estimators give their 90% CIs for the following variables:</p>
<ul><li><p>Maintenance savings (MS): $10 to $20 per unit</p></li><li><p>Labor savings (LS): -$2 to $8 per unit</p></li><li><p>Raw materials savings (RMS): $3 to $9 per unit</p></li><li><p>Production level (PL): 15,000 to 35,000 units per year</p></li></ul>
<p>Thus, your annual savings will equal (MS + LS + RMS) × PL.</p>
<p>When measuring risk, we don’t just want to know the “average” risk or benefit. We want to know the probability of a huge loss, the probability of a small loss, the probability of a huge savings, and so on. That’s what Monte Carlo can tell us.</p>
<p>An MC simulation uses a computer to randomly generate thousands of possible values for each variable, based on the ranges we’ve estimated. The computer then calculates the outcome (in this case, the annual savings) for each generated combination of values, and we’re able to see how often different kinds of outcomes occur.</p>
<p>To run an MC simulation we need not just the 90% CI for each variable but also the <em>shape</em> of each distribution. In many cases, the <a href="http://en.wikipedia.org/wiki/Normal_distribution">normal distribution</a> will work just fine, and we’ll use it for all the variables in this simplified illustration. (Hubbard’s book shows you how to work with other distributions).</p>
<p>To make an MC simulation of a normally distributed variable in Excel, we use this formula:</p>
<blockquote>
<p>=norminv(rand(), mean, standard deviation)</p>
</blockquote>
<p>So the formula for the maintenance savings variable should be:</p>
<blockquote>
<p>=norminv(rand(), 15, (20–10)/3.29)</p>
</blockquote>
<p>Suppose you enter this formula on cell A1 in Excel. To generate (say) 10,000 values for the maintenance savings value, just (1) copy the contents of cell A1, (2) enter “A1:A10000” in the cell range field to select cells A1 through A10000, and (3) paste the formula into all those cells.</p>
<p>Now we can follow this process in other columns for the other variables, including a column for the “total savings” formula. To see how many rows made a total savings of $400,000 or more (break-even), use Excel’s <a href="http://www.techonthenet.com/excel/formulas/countif.php">countif</a> function. In this case, you should find that about 14% of the scenarios resulted in a savings of less than $400,000 – a loss.</p>
<p><div class="imgonly"><img src="http://commonsenseatheism.com/wp-content/uploads/2013/08/histogram-of-MC-sim.png" alt="" align="right" loading="lazy"></div>We can also make a histogram (see right) to show how many of the 10,000 scenarios landed in each $100,000 increment (of total savings). This is even more informative, and tells us a great deal about the distribution of risk and benefits we might incur from investing in the new machine. (Download the full spreadsheet for this example <a href="http://www.hubbardresearch.com/htma-downloads/">here</a>.)</p>
<p>The simulation concept can (and in high-value cases <em>should</em>) be carried beyond this simple MC simulation. The first step is to learn how to use a greater variety of distributions in MC simulations. The second step is to deal with correlated (rather than independent) variables by generating correlated random numbers or by modeling what the variables have in common.</p>
<p>A more complicated step is to use a <a href="http://en.wikipedia.org/wiki/Markov_chain_Monte_Carlo">Markov simulation</a>, in which the simulated scenario is divided into many time intervals. This is often used to model stock prices, the weather, and complex manufacturing or construction projects. Another more complicated step is to use an <a href="http://en.wikipedia.org/wiki/Agent-based_model">agent-based model</a>, in which independently-acting agents are simulated. This method is often used for traffic simulations, in which each vehicle is modeled as an agent.</p>
<h3>Step 3: Pick a variable, and compute the value of additional information for that variable</h3>
<p>Information can have three kinds of value:</p>
<ol><li><p>Information can affect people’s behavior (e.g. common knowledge of germs affects sanitation behavior).</p></li><li><p>Information can have its own market value (e.g. you can sell a book with useful information).</p></li><li><p>Information can reduce uncertainty about important decisions. (This is what we’re focusing on here.)</p></li></ol>
<p>When you’re uncertain about a decision, this means there’s a chance you’ll make a non-optimal choice. The cost of a “wrong” decision is the difference between the wrong choice and the choice you would have made with perfect information. But it’s too costly to acquire perfect information, so instead we’d like to know which decision-relevant variables are the <em>most</em> valuable to measure more precisely, so we can decide which measurements to make.</p>
<p>Here’s a simple example:</p>
<blockquote>
<p>Suppose you could make $40 million profit if [an advertisement] works and lose $5 million (the cost of the campaign) if it fails. Then suppose your calibrated experts say they would put a 40% chance of failure on the campaign.</p>
</blockquote>
<p>The expected opportunity loss (EOL) for a choice is the probability of the choice being “wrong” times the cost of it being wrong. So for example the EOL if the campaign is approved is $5M × 40% = $2M, and the EOL if the campaign is rejected is $40M × 60% = $24M.</p>
<p>The difference between EOL before and after a measurement is called the “expected value of information” (EVI).</p>
<p>In most cases, we want to compute the VoI for a range of values rather than a binary succeed/fail. So let’s tweak the advertising campaign example and say that a calibrated marketing expert’s 90% CI for sales resulting from the campaign was from 100,000 units to 1 million units. The risk is that we don’t sell enough units from this campaign to break even.</p>
<p>Suppose we profit by $25 per unit sold, so we’d have to sell at least 200,000 units from the campaign to break even (on a $5M campaign). To begin, let’s calculate the expected value of <em>perfect</em> information (EVPI), which will provide an upper bound on how much we should spend to reduce our uncertainty about how many units will be sold as a result of the campaign. Here’s how we compute it:</p>
<ol><li><p>Slice the distribution of our variable into thousands of small segments.</p></li><li><p>Compute the EOL for each segment. EOL = segment midpoint times segment probability.</p></li><li><p>Sum the products from step 2 for all segments.</p></li></ol>
<p>Of course, we’ll do this with a computer. For the details, see Hubbard’s book and the Value of Information spreadsheet from <a href="http://www.hubbardresearch.com/htma-downloads/">his website</a>.</p>
<p>In this case, the EVPI turns out to be about $337,000. This means that we shouldn’t spend more than $337,000 to reduce our uncertainty about how many units will be sold as a result of the campaign.</p>
<p>And in fact, we should probably spend much less than $337,000, because no measurement we make will give us <em>perfect</em> information. For more details on how to measure the value of <em>imperfect</em> information, see Hubbard’s book and these three LessWrong posts: (1) <a href="https://www.greaterwrong.com/posts/xiojTDJP6FWdb2Fmb/value-of-information-8-examples">VoI: 8 Examples</a>, (2) <a href="https://www.greaterwrong.com/posts/vADtvr9iDeYsCDfxd/value-of-information-four-examples">VoI: Four Examples</a>, and (3) <a href="https://www.greaterwrong.com/posts/xDiqYyqeqPo92PojS/5-second-level-case-study-value-of-information">5-second level case study: VoI</a>.</p>
<p>I do, however, want to quote Hubbard’s comments about the “measurement inversion”:</p>
<blockquote>
<p>By 1999, I had completed the… Applied Information Economics analysis on about 20 major [IT] investments… Each of these business cases had 40 to 80 variables, such as initial development costs, adoption rate, productivity improvement, revenue growth, and so on. For each of these business cases, I ran a macro in Excel that computed the information value for each variable… [and] I began to see this pattern: * The vast majority of variables had an information value of zero… * The variables that had high information values were routinely those that the client had never measured… * The variables that clients [spent] the most time measuring were usually those with a very low (even zero) information value… …since then, I’ve applied this same test to another 40 projects, and… [I’ve] noticed the same phenomena arise in projects relating to research and development, military logistics, the environment, venture capital, and facilities expansion.</p>
</blockquote>
<p>Hubbard calls this the “Measurement Inversion”:</p>
<blockquote>
<p>In a business case, the economic value of measuring a variable is usually inversely proportional to how much measurement attention it usually gets.</p>
</blockquote>
<p>Here is one example:</p>
<blockquote>
<p>A stark illustration of the Measurement Inversion for IT projects can be seen in a large UK-based insurance client of mine that was an avid user of a software complexity measurement method called “function points.” This method was popular in the 1980s and 1990s as a basis of estimating the effort for large software development efforts. This organization had done a very good job of tracking initial estimates, function point estimates, and actual effort expended for over 300 IT projects. The estimation required three or four full-time persons as “certified” function point counters…</p>
</blockquote>
<blockquote>
<p>But a very interesting pattern arose when I compared the function point estimates to the initial estimates provided by project managers… The costly, time-intensive function point counting did change the initial estimate but, on average, it was no closer to the actual project effort than the initial effort… Not only was this the single largest measurement effort in the IT organization, it literally added <em>no</em> value since it didn’t reduce uncertainty at all. Certainly, more emphasis on measuring the benefits of the proposed projects – or almost anything else – would have been better money spent.</p>
</blockquote>
<p>Hence the importance of calculating EVI.</p>
<h3>Step 4: Apply the relevant measurement instrument(s) to the high-information-value variable</h3>
<p>If you followed the first three steps, then you’ve defined a variable you want to measure in terms of the decision it affects and how you observe it, you’ve quantified your uncertainty about it, and you’ve calculated the value of gaining additional information about it. Now it’s time to reduce your uncertainty about the variable – that is, to measure it.</p>
<p>Each scientific discipline has its own specialized measurement methods. Hubbard’s book describes measurement methods that are often useful for reducing our uncertainty about the “softer” topics often encountered by decision-makers in business.</p>
<h4>Selecting a measurement method</h4>
<p>To figure out which category of measurement methods are appropriate for a particular case, we must ask several questions:</p>
<ol><li><p>Decomposition: Which parts of the thing are we uncertain about?</p></li><li><p>Secondary research: How has the thing (or its parts) been measured by others?</p></li><li><p>Observation: How do the identified observables lend themselves to measurement?</p></li><li><p>Measure just enough: How much do we need to measure it?</p></li><li><p>Consider the error: How might our observations be misleading?</p></li></ol>
<h5>Decomposition</h5>
<p>Sometimes you’ll want to start by decomposing an uncertain variable into several parts to identify which observables you can most easily measure. For example, rather than directly estimating the cost of a large construction project, you could break it into parts and estimate the cost of each part of the project.</p>
<p>In Hubbard’s experience, it’s often the case that decomposition itself – even without making any new measurements – often reduces one’s uncertainty about the variable of interest.</p>
<h5>Secondary research</h5>
<p>Don’t reinvent the world. In almost all cases, someone has already invented the measurement tool you need, and you just need to find it. Here are Hubbard’s tips on secondary research:</p>
<ol><li><p>If you’re new to a topic, start with Wikipedia rather than Google. Wikipedia will give you a more organized perspective on the topic at hand.</p></li><li><p>Use search terms often associated with quantitative data. E.g. don’t just search for “software quality” or “customer perception” – add terms like “table,” “survey,” “control group,” and “standard deviation.”</p></li><li><p>Think of internet research in two levels: general search engines and topic-specific repositories (e.g. the CIA World Fact Book).</p></li><li><p>Try multiple search engines.</p></li><li><p>If you find marginally related research that doesn’t directly address your topic of interest, check the bibliography more relevant reading material.</p></li></ol>
<p>I’d also recommend my post <a href="https://www.greaterwrong.com/posts/37sHjeisS9uJufi4u/scholarship-how-to-do-it-efficiently">Scholarship: How to Do It Efficiently</a>.</p>
<h5>Observation</h5>
<p>If you’re not sure how to measure your target variable’s observables, ask these questions:</p>
<ol><li><p>Does it leave a trail? Example: longer waits on customer support lines cause customers to hang up and not call back. Maybe you can also find a correlation between customers who hang up after long waits and reduced sales to those customers.</p></li><li><p>Can you observe it directly? Maybe you haven’t been tracking how many of the customers in your parking lot show an out-of-state license, but you could start. Or at least, you can observe a sample of these data.</p></li><li><p>Can you create a way to observe it indirectly? <a href="http://Amazon.com" class="bare-url">Amazon.com</a> added a gift-wrapping feature in part so they could better track how many books were being purchased as gifts. Another example is when consumers are given coupons so that retailers can see which newspapers their customers read.</p></li><li><p>Can the thing be forced to occur under new conditions which allow you to observe it more easily? E.g. you could implement a proposed returned-items policy in some stores but not others and compare the outcomes.</p></li></ol>
<h5>Measure just enough</h5>
<p>Because initial measurements often tell you quite a lot, and also change the value of continued measurement, Hubbard often aims for spending 10% of the EVPI on a measurement, and sometimes as little as 2% (especially for very large projects).</p>
<h5>Consider the error</h5>
<p>It’s important to be conscious of some common ways in which measurements can mislead.</p>
<p>Scientists distinguish two types of measurement error: systemic and random. Random errors are random variations from one observation to the next. They can’t be individually predicted, but they fall into patterns that can be accounted for with the laws of probability. Systemic errors, in contrast, are consistent. For example, the sales staff may routinely overestimate the next quarter’s revenue by 50% (on average).</p>
<p>We must also distinguish precision and accuracy. A “precise” measurement tool has low random error. E.g. if a bathroom scale gives the exact same displayed weight every time we set a particular book on it, then the scale has high precision. An “accurate” measurement tool has low systemic error. The bathroom scale, while precise, might be inaccurate if the weight displayed is systemically biased in one direction – say, eight pounds too heavy. A measurement tool can also have low precision but good accuracy, if it gives inconsistent measurements but they average to the true value.</p>
<p>Random error tends to be easier to handle. Consider this example:</p>
<blockquote>
<p>For example, to determine how much time sales reps spend in meetings with clients versus other administrative tasks, they might choose a complete review of all time sheets… [But] if a complete review of 5,000 time sheets… tells us that sales reps spend 34% of their time in direct communication with customers, we still don’t know how far from the truth it might be. Still, this “exact” number seems reassuring to many managers. Now, suppose a sample of direct observations of randomly chosen sales reps at random points in time finds that sales reps were in client meetings or on client phone calls only 13 out of 100 of those instances. (We can compute this without interrupting a meeting by asking as soon as the rep is available.) As we will see [later], in the latter case, we can statistically compute a 90% CI to be 7.5% to 18.5%. Even though this random sampling approach gives us only a range, we should prefer its findings to the census audit of time sheets. The census… gives us an exact number, but we have no way to know by how much and in which direction the time sheets err.</p>
</blockquote>
<p>Systemic error is also called a “bias.” Based on his experience, Hubbard suspects the three most important to avoid are:</p>
<ol><li><p>Confirmation bias: people see what they want to see.</p></li><li><p>Selection bias: your sample might not be representative of the group you’re trying to measure.</p></li><li><p>Observer bias: the very act of observation can affect what you observe. E.g. in one study, researchers found that worker productivity improved no matter <em>what</em> they changed about the workplace. The workers seem to have been responding merely to the <em>fact</em> that they were being observed in <em>some</em> way.</p></li></ol>
<h5>Choose and design the measurement instrument</h5>
<p>After following the above steps, Hubbard writes, “the measurement instrument should be almost completely formed in your mind.” But if you still can’t come up with a way to measure the target variable, here are some additional tips:</p>
<ol><li><p><em>Work through the consequences</em>. If the value is surprisingly high, or surprisingly low, what would you expect to see?</p></li><li><p><em>Be iterative</em>. Start with just a few observations, and then recalculate the information value.</p></li><li><p><em>Consider multiple approaches</em>. Your first measurement tool may not work well. Try others.</p></li><li><p><em>What’s the really simple question that makes the rest of the measurement moot?</em> First see if you can detect <em>any</em> change in research quality before trying to measure it more comprehensively.</p></li></ol>
<h4>Sampling reality</h4>
<p>In most cases, we’ll estimate the values in a population by measuring the values in a small sample from that population. And for reasons discussed in chapter 7, a very small sample can often offer large reductions in uncertainty.</p>
<p>There are a variety of tools we can use to build our estimates from small samples, and which one we should use often depends on how outliers are distributed in the population. In some cases, outliers are very close to the mean, and thus our estimate of the mean can converge quickly on the true mean as we look at new samples. In other cases, outliers can be several orders of magnitude away from the mean, and our estimate converges very slowly or not at all. Here are some examples:</p>
<ul><li><p>Very quick convergence, only 1–2 samples needed: cholesterol level of your blood, purity of public water supply, weight of jelly beans.</p></li><li><p>Usually quickly convergence, 5–30 samples needed: Percentage of customers who like the new product, failure loads of bricks, age of your customers, how many movies people see in a year.</p></li><li><p>Potentially slow convergence: Software project cost overruns, factory downtime due to an accident.</p></li><li><p>Maybe non-convergent: Market value of corporations, individual levels of income, casualties of wars, size of volcanic eruptions.</p></li></ul>
<p>Below, I survey just a few of the many sampling methods Hubbard covers in his book.</p>
<h5>Mathless estimation</h5>
<p>When working with a quickly converging phenomenon and a symmetric distribution (uniform, normal, camel-back, or bow-tie) for the population, you can use the <a href="http://en.wikipedia.org/wiki/T-statistic">t-statistic</a> to develop a 90% CI even when working with very small samples. (See the book for instructions.)</p>
<p>Or, even easier, make use of the <em>Rule of FIve</em>: “There is a 93.75% chance that the median of a population is between the smallest and largest values in any random sample of five from that population.”</p>
<p>The Rule of Five has another advantage over the t-statistic: it works for any distribution of values in the population, including ones with slow convergence or no convergence at all! It can do this because it gives us a confidence interval for the <em>median</em> rather than the <em>mean</em>, and it’s the mean that is far more affected by outliers.</p>
<p>Hubbard calls this a “mathless” estimation technique because it doesn’t require us to take square roots or calculate standard deviation or anything like that. Moreover, this mathless technique extends beyond the Rule of Five: If we sample 8 items, there is a 99.2% chance that the median of the population falls within the largest and smallest values. If we take the <em>2nd</em> largest and smallest values (out of 8 total values), we get something close to a 90% CI for the median. Hubbard generalizes the tool with this handy reference table:</p>
<p class="imgonly"><img src="http://commonsenseatheism.com/wp-content/uploads/2013/08/mathless-90-percent-CI-for-median.jpg" alt="" align="center" loading="lazy"></p>
<p>And if the distribution is symmetrical, then the mathless table gives us a 90% CI for the mean as well as for the median.</p>
<h5>Catch-recatch</h5>
<p>How does a biologist measure the number of fish in a lake? SHe catches and tags a sample of fish – say, 1000 of them – and then releases them. After the fish have had time to spread amongst the rest of the population, she’ll catch another sample of fish. Suppose she caught 1000 fish again, and 50 of them were tagged. This would mean 5% of the fish were tagged, and thus that were about 20,000 fish in the entire lake. (See Hubbard’s book for the details on how to calculate the 90% CI.)</p>
<h5>Spot sampling</h5>
<p>The fish example was a special case of a common problem: population proportion sampling. Often, we want to know what proportion of a population has a particular trait. How many registered voters in California are Democrats? What percentage of your customers prefer a new product design over the old one?</p>
<p>Hubbard’s book discusses how to solve the general problem, but for now let’s just consider another special case: spot sampling.</p>
<p>In spot sampling, you take random snapshots of things rather than tracking them constantly. What proportion of their work hours do employees spend on Facebook? To answer this, you “randomly sample people through the day to see what they were doing <em>at that moment</em>. If you find that in 12 instances out of 100 random samples” employees were on Facebook, you can guess they spend about 12% of their time on Facebook (the 90% CI is 8% to 18%).</p>
<h5>Clustered sampling</h5>
<p>Hubbard writes:</p>
<blockquote>
<p>“Clustered sampling” is defined as taking a random sample of groups, then conducting a census or a more concentrated sampling within the group. For example, if you want to see what share of households has satellite dishes… it might be cost effective to randomly choose several city blocks, then conduct a complete census of everything in a block. (Zigzagging across town to individually selected households would be time consuming.) In such cases, we can’t really consider the number of [households] in the groups… to be the number of random samples. Within a block, households may be very similar… [and therefore] it might be necessary to treat the effective number of random samples as the number of blocks…</p>
</blockquote>
<h5>Measure to the threshold</h5>
<p>For many decisions, one decision is required if a value is above some threshold, and another decision is required if that value is below the threshold. For such decisions, you don’t care as much about a measurement that reduces uncertainty in general as you do about a measurement that tells you which decision to make based on the threshold. Hubbard gives an example:</p>
<blockquote>
<p>Suppose you needed to measure the average amount of time spent by employees in meetings that could be conducted remotely… If a meeting is among staff members who communicate regularly and for a relatively routine topic, but someone has to travel to make the meeting, you probably can conduct it remotely. You start out with your calibrated estimate that the median employee spends between 3% to 15% traveling to meetings that could be conducted remotely. You determine that if this percentage is actually over 7%, you should make a significant investment in tele meetings. The [EVPI] calculation shows that it is worth no more than $15,000 to study this. According to our rule of thumb for measurement costs, we might try to spend about $1,500…</p>
</blockquote>
<blockquote>
<p>Let’s say you sampled 10 employees and… you find that only 1 spends less time in these activities than the 7% threshold. Given this information, what is the chance that the median time spent in such activities is actually below 7%, in which case the investment would not be justified? One “common sense” answer is <span class="frac"><sup>1</sup>⁄<sub>10</sub></span>, or 10%. Actually… the real chance is much smaller.</p>
</blockquote>
<p>Hubbard shows how to derive the real chance in his book. The key point is that “the uncertainty about the threshold can fall much faster than the uncertainty about the quantity in general.”</p>
<h5>Regression modeling</h5>
<p>What if you want to figure out the cause of something that has many possible causes? One method is to perform a <em>controlled experiment</em>, and compare the outcomes of a test group to a control group. Hubbard discusses this in his book (and yes, he’s a Bayesian, and a skeptic of p-value hypothesis testing). For this summary, I’ll instead mention another method for isolating causes: regression modeling. Hubbard explains:</p>
<blockquote>
<p>If we use regression modeling with historical data, we may not need to conduct a controlled experiment. Perhaps, for example, it is difficult to tie an IT project to an increase in sales, but we might have lots of data about how something <em>else</em> affects sales, such as faster time to market of new products. If we know that faster time to market is possible by automating certain tasks, that this IT investment eliminates certain tasks, and those tasks are on the critical path in the time-to-market, we can make the connection.</p>
</blockquote>
<p>Hubbard’s book explains the basics of linear regressions, and of course gives the caveat that correlation does not imply causation. But, he writes, “you should conclude that one thing causes another only if you have some <em>other</em> good reason besides the correlation itself to suspect a cause-and-effect relationship.”</p>
<h4>Bayes</h4>
<p>Hubbard’s 10th chapter opens with a tutorial on Bayes’ Theorem. For an online tutorial, see <a href="http://yudkowsky.net/rational/bayes">here</a>.</p>
<p>Hubbard then zooms out to a big-picture view of measurement, and recommends the “instinctive Bayesian approach”:</p>
<ol><li><p>Start with your calibrated estimate.</p></li><li><p>Gather additional information (polling, reading other studies, etc.)</p></li><li><p>Update your calibrated estimate subjectively, without doing any additional math.</p></li></ol>
<p>Hubbard says a few things in support of this approach. First, he points to some studies (e.g. <a href="http://www.tandfonline.com/doi/abs/10.1080/01621459.1995.10476620">El-Gamal & Grether (1995)</a>) showing that people often reason in roughly-Bayesian ways. Next, he says that in his experience, people become better intuitive Bayesians when they (1) are made aware of the <a href="http://en.wikipedia.org/wiki/Base_rate_fallacy">base rate fallacy</a>, and when they (2) are better calibrated.</p>
<p>Hubbard says that once these conditions are met,</p>
<blockquote>
<p>[then] humans seem to be mostly logical when incorporating new information into their estimates along with the old information. This fact is extremely useful because a human can consider qualitative information that does not fit in standard statistics. For example, if you were giving a forecast for how a new policy might change “public image” – measured in part by a reduction in customer complaints, increased revenue, and the like – a calibrated expert should be able to update current knowledge with “qualitative” information about how the policy worked for other companies, feedback from focus groups, and similar details. Even with sampling information, the calibrated estimator – who has a Bayesian instinct – can consider qualitative information on samples that most textbooks don’t cover.</p>
</blockquote>
<p>He also offers a chart showing how a pure Bayesian estimator compares to other estimators:</p>
<p class="imgonly"><img src="http://commonsenseatheism.com/wp-content/uploads/2013/08/confidence-versus-information-emphasis.jpg" alt="" align="center" loading="lazy"></p>
<p>Also, Bayes’ Theorem allows us to perform a “Bayesian inversion”:</p>
<blockquote>
<p>Given a particular observation, it may seem more obvious to frame a measurement by asking the question “What can I conclude from this observation?” or, in probabilistic terms, “What is the probability X is true, given my observation?” But Bayes showed us that we could, instead, start with the question, “What is the probability of this observation if X were true?”</p>
</blockquote>
<blockquote>
<p>The second form of the question is useful because the answer is often more straightforward and it leads to the answer to the other question. It also forces us to think about the likelihood of different observations given a particular hypothesis and what that means for interpreting an observation.</p>
</blockquote>
<blockquote>
<p>[For example] if, hypothetically, we know that only 20% of the population will continue to shop at our store, then we can determine the chance [that] exactly 15 out of 20 would say so… [The details are explained in the book.] Then we can invert the problem with Bayes’ theorem to compute the chance that only 20% of the population will continue to shop there given [that] 15 out of 20 said so in a random sample. We would find that chance to be very nearly zero…</p>
</blockquote>
<h4>Other methods</h4>
<p>Other chapters discuss other measurement methods, for example prediction markets, Rasch models, methods for measuring preferences and happiness, methods for improving the subjective judgments of experts, and many others. </p>
<h3>Step 5: Make a decision and act on it</h3>
<p>The last step will make more sense if we first “bring the pieces together.” Hubbard now organizes his consulting work with a firm into 3 phases, so let’s review what we’ve learned in the context of his 3 phases.</p>
<h4>Phase 0: Project Preparation</h4>
<ul><li><p><em>Initial research</em>: Interviews and secondary research to get familiar on the nature of the decision problem.</p></li><li><p><em>Expert identification</em>: Usually 4–5 experts who provide estimates.</p></li></ul>
<h4>Phase 1: Decision Modeling</h4>
<ul><li><p><em>Decision problem definition</em>: Experts define the problem they’re trying to analyze.</p></li><li><p><em>Decision model detail</em>: Using an Excel spreadsheet, the AIE analyst elicits from the experts all the factors that matter for the decision being analyzed: costs and benefits, ROI, etc.</p></li><li><p><em>Initial calibrated estimates</em>: First, the experts undergo calibration training. Then, they fill in the values (as 90% CIs or other probability distributions) for the variables in the decision model.</p></li></ul>
<h4>Phase 2: Optimal measurements</h4>
<ul><li><p><em>Value of information analysis</em>: Using Excel macros, the AIE analyst runs a value of information analysis on every variable in the model.</p></li><li><p><em>Preliminary measurement method designs</em>: Focusing on the few variables with highest information value, the AIE analyst chooses measurement methods that should reduce uncertainty.</p></li><li><p><em>Measurement methods</em>: Decomposition, random sampling, Bayesian inversion, controlled experiments, and other methods are used (as appropriate) to reduce the uncertainty of the high-VoI variables.</p></li><li><p><em>Updated decision model</em>: The AIE analyst updates the decision model based on the results of the measurements.</p></li><li><p><em>Final value of information analysis</em>: The AIE analyst runs a VoI analysis on each variable again. As long as this analysis shows information value much greater than the cost of measurement for some variables, measurement and VoI analysis continues in multiple iterations. Usually, though, only one or two iterations are needed before the VoI analysis shows that no further measurements are justified.</p></li></ul>
<h4>Phase 3: Decision optimization and the final recommendation</h4>
<ul><li><p><em>Completed risk/return analysis</em>: A final MC simulation shows the likelihood of possible outcomes.</p></li><li><p><em>Identified metrics procedures</em>: Procedures are put in place to measure some variables (e.g. about project progress or external factors) continually.</p></li><li><p><em>Decision optimization</em>: The final business decision recommendation is made (this is rarely a simple “yes/no” answer).</p></li></ul>
<h4>Final thoughts</h4>
<p>Hubbard’s book includes two case studies in which Hubbard describes how he led two fairly different clients (the EPA and U.S. Marine Corps) through each phase of the AIE process. Then, he closes the book with the following summary:</p>
<ul><li><p>If it’s really that important, it’s something you can define. If it’s something you think exists at all, it’s something you’ve already observed somehow.</p></li><li><p>If it’s something important and something uncertain, you have a cost of being wrong and a chance of being wrong.</p></li><li><p>You can quantify your current uncertainty with calibrated estimates.</p></li><li><p>You can compute the value of additional information by knowing the “threshold” of the measurement where it begins to make a difference compared to your existing uncertainty.</p></li><li><p>Once you know what it’s worth to measure something, you can put the measurement effort in context and decide on the effort it should take.</p></li><li><p>Knowing just a few methods for random sampling, controlled experiments, or even merely improving on the judgments of experts can lead to a significant reduction in uncertainty.</p></li></ul>lukeprogybYBCK9D7MZCcdArBWed, 07 Aug 2013 04:05:58 +0000Decision Theory FAQ by lukeprog
https://www.greaterwrong.com/posts/zEWJBFFMvQ835nq6h/decision-theory-faq
<p><small>Co-authored with <a href="https://www.lesswrong.com/user/crazy88/overview/">crazy88</a>. Please let us know when you find mistakes, and we’ll fix them. Last updated 03-27-2013.</small></p>
<p><strong>Contents</strong>:</p>
<div id="TOC">
<ul><li><p><a href="#what-is-decision-theory">1. What is decision theory?</a></p></li><li><p><a href="#is-the-rational-decision-always-the-right-decision">2. Is the rational decision always the right decision?</a></p></li><li><p><a href="#how-can-i-better-understand-a-decision-problem">3. How can I better understand a decision problem?</a></p></li><li><p><a href="#how-can-i-measure-an-agents-preferences">4. How can I measure an agent’s preferences?</a>
<ul><li><p><a href="#the-concept-of-utility">4.1. The concept of utility</a></p></li><li><p><a href="#types-of-utility">4.2. Types of utility</a></p></li></ul>
</p></li><li><p><a href="#what-do-decision-theorists-mean-by-risk-ignorance-and-uncertainty">5. What do decision theorists mean by “risk,” “ignorance,” and “uncertainty”?</a></p></li><li><p><a href="#how-should-i-make-decisions-under-ignorance">6. How should I make decisions under ignorance?</a>
<ul><li><p><a href="#the-dominance-principle">6.1. The dominance principle</a></p></li><li><p><a href="#maximin-and-leximin">6.2. Maximin and leximin</a></p></li><li><p><a href="#maximax-and-optimism-pessimism">6.3. Maximax and optimism-pessimism</a></p></li><li><p><a href="#other-decision-principles">6.4. Other decision principles</a></p></li></ul>
</p></li><li><p><a href="#can-decisions-under-ignorance-be-transformed-into-decisions-under-uncertainty">7. Can decisions under ignorance be transformed into decisions under uncertainty?</a></p></li><li><p><a href="#how-should-i-make-decisions-under-uncertainty">8. How should I make decisions under uncertainty?</a>
<ul><li><p><a href="#the-law-of-large-numbers">8.1. The law of large numbers</a></p></li><li><p><a href="#the-axiomatic-approach">8.2. The axiomatic approach</a></p></li><li><p><a href="#the-von-neumann-morgenstern-utility-theorem">8.3. The Von Neumann-Morgenstern utility theorem</a></p></li><li><p><a href="#vnm-utility-theory-and-rationality">8.4. VNM utility theory and rationality</a></p></li><li><p><a href="#objections-to-vnm-rationality">8.5. Objections to VNM-rationality</a></p></li><li><p><a href="#should-we-accept-the-vnm-axioms">8.6. Should we accept the VNM axioms?</a></p></li></ul>
</p></li><li><p><a href="#does-axiomatic-decision-theory-offer-any-action-guidance">9. Does axiomatic decision theory offer any action guidance?</a></p></li><li><p><a href="#how-does-probability-theory-play-a-role-in-decision-theory">10. How does probability theory play a role in decision theory?</a>
<ul><li><p><a href="#the-basics-of-probability-theory">10.1. The basics of probability theory</a></p></li><li><p><a href="#bayes-theorem-for-updating-probabilities">10.2. Bayes theorem for updating probabilities</a></p></li><li><p><a href="#how-should-probabilities-be-interpreted">10.3. How should probabilities be interpreted?</a></p></li></ul>
</p></li><li><p><a href="#what-about-newcombs-problem-and-alternative-decision-algorithms">11. What about “Newcomb’s problem” and alternative decision algorithms?</a>
<ul><li><p><a href="#newcomblike-problems-and-two-decision-algorithms">11.1. Newcomblike problems and two decision algorithms</a></p></li><li><p><a href="#benchmark-theory-bt">11.2. Benchmark theory (BT)</a></p></li><li><p><a href="#timeless-decision-theory-tdt">11.3. Timeless decision theory (TDT)</a></p></li><li><p><a href="#decision-theory-and-winning">11.4. Decision theory and “winning”</a></p></li></ul>
</p></li></ul>
</div>
<h2><a href="#what-is-decision-theory">1. What is decision theory?</a></h2>
<p><em>Decision theory</em>, also known as <em>rational choice theory</em>, concerns the study of preferences, uncertainties, and other issues related to making “optimal” or “rational” choices. It has been discussed by economists, psychologists, philosophers, mathematicians, statisticians, and computer scientists.</p>
<p>We can divide decision theory into three parts (<a href="http://www.owlnet.rice.edu/~econ501/lectures/Decision_EU.pdf">Grant & Zandt 2009</a>; <a href="http://www.amazon.com/dp/0521680433/ref=as_li_ss_tl?ie=UTF8&camp=1789&creative=390957&creativeASIN=0321928423&linkCode=as2&tag=lesswrong-20">Baron 2008</a>). <em>Normative</em> decision theory studies what an ideal agent (a perfectly rational agent, with infinite computing power, etc.) would choose. <em>Descriptive</em> decision theory studies how non-ideal agents (e.g. humans) <em>actually</em> choose. <em>Prescriptive</em> decision theory studies how non-ideal agents can improve their decision-making (relative to the normative model) despite their imperfections.</p>
<p>For example, one’s <em>normative</em> model might be <a href="http://kleene.ss.uci.edu/lpswiki/index.php/Expected_Utility_Theory">expected utility theory</a>, which says that a rational agent chooses the action with the highest expected utility. Replicated results in psychology <em>describe</em> humans repeatedly <em>failing</em> to maximize expected utility in particular, <a href="http://www.amazon.com/Predictably-Irrational-Revised-Expanded-Edition/dp/0061353248/ref=as_li_ss_tl?ie=UTF8&camp=1789&creative=390957&creativeASIN=0321928423&linkCode=as2&tag=lesswrong-20">predictable</a> ways: for example, they make some choices based not on potential future benefits but on irrelevant past efforts (the “<a href="http://en.wikipedia.org/wiki/Sunk_costs">sunk cost fallacy</a>”). To help people avoid this error, some theorists <em>prescribe</em> some basic training in microeconomics, which has been shown to reduce the likelihood that humans will commit the sunk costs fallacy (<a href="http://commonsenseatheism.com/wp-content/uploads/2012/08/Larrick-et-al-Teaching-the-use-of-cost-benefit-reasoning-in-everyday-life.pdf">Larrick et al. 1990</a>). Thus, through a coordination of normative, descriptive, and prescriptive research we can help agents to succeed in life by acting more in accordance with the normative model than they otherwise would.</p>
<p>This FAQ focuses on normative decision theory. Good sources on descriptive and prescriptive decision theory include <a href="http://www.amazon.com/Rationality-Reflective-Mind-Keith-Stanovich/dp/0195341147/ref=as_li_ss_tl?ie=UTF8&camp=1789&creative=390957&creativeASIN=0321928423&linkCode=as2&tag=lesswrong-20">Stanovich (2010)</a> and <a href="http://www.amazon.com/Rational-Choice-Uncertain-World-Psychology/dp/1412959039/ref=as_li_ss_tl?ie=UTF8&camp=1789&creative=390957&creativeASIN=0321928423&linkCode=as2&tag=lesswrong-20">Hastie & Dawes (2009)</a>.</p>
<p>Two related fields beyond the scope of this FAQ are <a href="http://en.wikipedia.org/wiki/Game_theory">game theory</a> and <a href="http://en.wikipedia.org/wiki/Social_choice_theory">social choice theory</a>. Game theory is the study of conflict and cooperation among multiple decision makers, and is thus sometimes called “interactive decision theory.” Social choice theory is the study of making a collective decision by combining the preferences of multiple decision makers in various ways.</p>
<p>This FAQ draws heavily from two textbooks on decision theory: <a href="http://www.amazon.com/Choices-An-Introduction-Decision-Theory/dp/0816614407/ref=as_li_ss_tl?ie=UTF8&camp=1789&creative=390957&creativeASIN=0321928423&linkCode=as2&tag=lesswrong-20">Resnik (1987)</a> and <a href="http://www.amazon.com/Introduction-Decision-Cambridge-Introductions-Philosophy/dp/0521716543/ref=as_li_ss_tl?ie=UTF8&camp=1789&creative=390957&creativeASIN=0321928423&linkCode=as2&tag=lesswrong-20">Peterson (2009)</a>. It also draws from more recent results in decision theory, published in journals such as <em><a href="http://www.springerlink.com/content/0039-7857">Synthese</a></em> and <em><a href="http://www.springerlink.com/content/0040-5833">Theory and Decision</a></em>.</p>
<h2 id="is-the-rational-decision-always-the-right-decision"><a href="#is-the-rational-decision-always-the-right-decision">2. Is the rational decision always the right decision?</a></h2>
<p>No. Peterson (<a href="http://www.amazon.com/Introduction-Decision-Cambridge-Introductions-Philosophy/dp/0521716543/ref=as_li_ss_tl?ie=UTF8&camp=1789&creative=390957&creativeASIN=0321928423&linkCode=as2&tag=lesswrong-20">2009</a>, ch. 1) explains:</p>
<blockquote>
<p>[In 1700], King Carl of Sweden and his 8,000 troops attacked the Russian army [which] had about ten times as many troops… Most historians agree that the Swedish attack was irrational, since it was almost certain to fail… However, because of an unexpected blizzard that blinded the Russian army, the Swedes won...</p>
</blockquote>
<blockquote>
<p>Looking back, the Swedes’ decision to attack the Russian army was no doubt right, since the <em>actual outcome</em> turned out to be success. However, since the Swedes had no <em>good reason</em> for expecting that they were going to win, the decision was nevertheless irrational.</p>
</blockquote>
<blockquote>
<p>More generally speaking, we say that a decision is <em>right</em> if and only if its actual outcome is at least as good as that of every other possible outcome. Furthermore, we say that a decision is <em>rational</em> if and only if the decision maker [<em>aka</em> the “agent”] chooses to do what she has most reason to do at the point in time at which the decision is made.</p>
</blockquote>
<p>Unfortunately, we cannot know with certainty what the right decision is. Thus, the best we can do is to try to make “rational” or “optimal” decisions based on our preferences and incomplete information.</p>
<h2 id="how-can-i-better-understand-a-decision-problem"><a href="#how-can-i-better-understand-a-decision-problem">3. How can I better understand a decision problem?</a></h2>
<p>First, we must <em>formalize</em> a decision problem. It usually helps to <em>visualize</em> the decision problem, too.</p>
<p>In decision theory, decision rules are only defined relative to a formalization of a given decision problem, and a formalization of a decision problem can be visualized in multiple ways. Here is an example from Peterson (<a href="http://www.amazon.com/Introduction-Decision-Cambridge-Introductions-Philosophy/dp/0521716543/ref=as_li_ss_tl?ie=UTF8&camp=1789&creative=390957&creativeASIN=0321928423&linkCode=as2&tag=lesswrong-20">2009</a>, ch. 2):</p>
<blockquote>
<p>Suppose… that you are thinking about taking out fire insurance on your home. Perhaps it costs $100 to take out insurance on a house worth $100,000, and you ask: Is it worth it?</p>
</blockquote>
<p>The most common way to formalize a decision problem is to break it into states, acts, and outcomes. When facing a decision problem, the decision maker aims to choose the <em>act</em> that will have the best <em>outcome</em>. But the outcome of each act depends on the <em>state</em> of the world, which is unknown to the decision maker.</p>
<p>In this framework, speaking loosely, a state is a part of the world that is not an act (that can be performed now by the decision maker) or an outcome (the question of what, more precisely, states are is a complex question that is beyond the scope of this document). Luckily, not all states are relevant to a particular decision problem. We only need to take into account states that affect the agent’s preference among acts. A simple formalization of the fire insurance problem might include only two states: the state in which your house doesn’t (later) catch on fire, and the state in which your house <em>does</em> (later) catch on fire.</p>
<p>Presumably, the agent prefers some outcomes to others. Suppose the four conceivable outcomes in the above decision problem are: (1) House and $0, (2) House and -$100, (3) No house and $99,900, and (4) No house and $0. In this case, the decision maker might prefer outcome 1 over outcome 2, outcome 2 over outcome 3, and outcome 3 over outcome 4. (We’ll discuss measures of value for outcomes in the next section.)</p>
<p>An act is commonly taken to be a function that takes one set of the possible states of the world as input and gives a particular outcome as output. For the above decision problem we could say that if the act “Take out insurance” has the world-state “Fire” as its input, then it will give the outcome “No house and $99,900” as its output.</p>
<div class="figure"><div class="imgonly"><img src="http://i.imgur.com/DhCAW.jpg" alt="An outline of the states, acts and outcomes in the insurance case" loading="lazy"></div>
<p class="caption">An outline of the states, acts and outcomes in the insurance case</p>
</div>
<p>Note that decision theory is concerned with <em>particular</em> acts rather than <em>generic</em> acts, e.g. “sailing west in 1492” rather than “sailing.” Moreover, the acts of a decision problem must be <em>alternative</em> acts, so that the decision maker has to choose exactly <em>one</em> act.</p>
<p>Once a decision problem has been formalized, it can then be visualized in any of several ways.</p>
<p>One way to visualize this decision problem is to use a <em>decision matrix</em>:</p>
<table border="0" cellspacing="5" cellpadding="3">
<tbody>
<tr>
<td class="numeric"> </td>
<td><em>Fire</em></td>
<td><em>No fire</em></td>
</tr>
<tr>
<td><em>Take out insurance</em></td>
<td>No house and $99,900</td>
<td>House and -$100</td>
</tr>
<tr>
<td><em>No insurance</em></td>
<td>No house and $0</td>
<td>House and $0</td>
</tr>
</tbody>
</table>
<p>Another way to visualize this problem is to use a <em>decision tree</em>:</p>
<p class="imgonly"><img src="https://res.cloudinary.com/lesswrong-2-0/image/upload/v1608693326/basic-decision-tree_ohriwo.gif" alt="" loading="lazy"></p>
<p>The square is a <em>choice node</em>, the circles are <em>chance nodes</em>, and the triangles are <em>terminal nodes</em>. At the choice node, the decision maker chooses which branch of the decision tree to take. At the chance nodes, <em>nature</em> decides which branch to follow. The triangles represent outcomes.</p>
<p>Of course, we could add more branches to each choice node and each chance node. We could also add more choice nodes, in which case we are representing a <em>sequential</em> decision problem. Finally, we could add probabilities to each branch, as long as the probabilities of all the branches extending from each single node sum to 1. And because a decision tree obeys the laws of probability theory, we can calculate the probability of any given node by multiplying the probabilities of all the branches preceding it.</p>
<p>Our decision problem could also be represented as a <em>vector</em> — an ordered list of mathematical objects that is perhaps most suitable for computers:</p>
<blockquote>
<p>[<br> [a<sub>1</sub> = take out insurance,<br> a<sub>2</sub> = do not];<br> [s<sub>1</sub> = fire,<br> s<sub>2</sub> = no fire];<br> [(a<sub>1</sub>, s<sub>1</sub>) = No house and $99,900,<br> (a<sub>1</sub>, s<sub>2</sub>) = House and -$100,<br> (a<sub>2</sub>, s<sub>1</sub>) = No house and $0,<br> (a<sub>2</sub>, s<sub>2</sub>) = House and $0]<br> ]</p>
</blockquote>
<p>For more details on formalizing and visualizing decision problems, see <a href="http://www.amazon.com/Introduction-Decision-Analysis-3rd-Edition/dp/0964793865/ref=as_li_ss_tl?ie=UTF8&camp=1789&creative=390957&creativeASIN=0321928423&linkCode=as2&tag=lesswrong-20">Skinner (1993)</a>.</p>
<h2 id="how-can-i-measure-an-agents-preferences"><a href="#how-can-i-measure-an-agents-preferences">4. How can I measure an agent’s preferences?</a></h2>
<h3 id="the-concept-of-utility"><a href="#the-concept-of-utility">4.1. The concept of utility</a></h3>
<p>It is important not to measure an agent’s preferences in terms of <em>objective</em> value, e.g. monetary value. To see why, consider the absurdities that can result when we try to measure an agent’s preference with money alone.</p>
<p>Suppose you may choose between (A) receiving a million dollars <em>for sure</em>, and (B) a 50% chance of winning either $3 million or nothing. The <em>expected monetary value</em> (EMV) of your act is computed by multiplying the monetary value of each possible outcome by its probability. So, the EMV of choice A is (1)($1 million) = $1 million. The EMV of choice B is (0.5)($3 million) + (0.5)($0) = $1.5 million. Choice B has a higher expected monetary value, and yet many people would prefer the guaranteed million.</p>
<p>Why? For many people, the difference between having $0 and $1 million is <em>subjectively</em> much larger than the difference between having $1 million and $3 million, even if the latter difference is larger in dollars.</p>
<p>To capture an agent’s <em>subjective</em> preferences, we use the concept of <em>utility</em>. A <em>utility function</em> assigns numbers to outcomes such that outcomes with higher numbers are preferred to outcomes with lower numbers. For example, for a particular decision maker — say, one who has no money — the utility of $0 might be 0, the utility of $1 million might be 1000, and the utility of $3 million might be 1500. Thus, the <em>expected utility</em> (EU) of choice A is, for this decision maker, (1)(1000) = 1000. Meanwhile, the EU of choice B is (0.5)(1500) + (0.5)(0) = 750. In this case, the expected utility of choice A is greater than that of choice B, even though choice B has a greater expected monetary value.</p>
<p>Note that those from the field of statistics who work on decision theory tend to talk about a “loss function,” which is simply an <em>inverse</em> utility function. For an overview of decision theory from this perspective, see <a href="http://www.amazon.com/Statistical-Decision-Bayesian-Analysis-Statistics/dp/1441930744/ref=as_li_ss_tl?ie=UTF8&camp=1789&creative=390957&creativeASIN=0321928423&linkCode=as2&tag=lesswrong-20">Berger (1985)</a> and <a href="http://www.amazon.com/Bayesian-Choice-Decision-Theoretic-Computational-Implementation/dp/0387715983/ref=as_li_ss_tl?ie=UTF8&camp=1789&creative=390957&creativeASIN=0321928423&linkCode=as2&tag=lesswrong-20">Robert (2001)</a>. For a critique of some standard results in statistical decision theory, see <a href="http://www.amazon.com/Probability-Theory-The-Logic-Science/dp/0521592712/ref=as_li_ss_tl?ie=UTF8&camp=1789&creative=390957&creativeASIN=0321928423&linkCode=as2&tag=lesswrong-20">Jaynes (2003, ch. 13)</a>.</p>
<h3 id="types-of-utility"><a href="#types-of-utility">4.2. Types of utility</a></h3>
<p>An agent’s utility function can’t be directly observed, so it must be constructed — e.g. by asking them which options they prefer for a large set of pairs of alternatives (as on <a href="http://www.whoishotter.com">WhoIsHotter.com</a>). The number that corresponds to an outcome’s utility can convey different information depending on the <em>utility scale</em> in use, and the utility scale in use depends on how the utility function is constructed.</p>
<p>Decision theorists distinguish three kinds of utility scales:</p>
<ol style=""><li>
<p>Ordinal scales (“12 is better than 6”). In an ordinal scale, preferred outcomes are assigned higher numbers, but the numbers don’t tell us anything about the differences or ratios between the utility of different outcomes.</p>
</li><li>
<p>Interval scales (“the difference between 12 and 6 equals that between 6 and 0”). An interval scale gives us more information than an ordinal scale. Not only are preferred outcomes assigned higher numbers, but also the numbers accurately reflect the <em>difference</em> between the utility of different outcomes. They do not, however, necessarily reflect the ratios of utility between different outcomes. If outcome A has utility 0, outcome B has utility 6, and outcome C has utility 12 on an interval scale, then we know that the difference in utility between outcomes A and B and between outcomes B and C is the same, but we can’t know whether outcome B is “twice as good” as outcome A.</p>
</li><li>
<p>Ratio scales (“12 is exactly <em>twice</em> as valuable as 6”). Numerical utility assignments on a ratio scale give us the most information of all. They accurately reflect preference rankings, differences, <em>and</em> ratios. Thus, we can say that an outcome with utility 12 is exactly <em>twice</em> as valuable to the agent in question as an outcome with utility 6.</p>
</li></ol>
<p>Note that neither <em>experienced utility</em> (happiness) nor the notions of “average utility” or “total utility” discussed by utilitarian moral philosophers are the same thing as the <em>decision utility</em> that we are discussing now to describe decision preferences. As the situation merits, we can be even more specific. For example, when discussing the type of decision utility used in an interval scale utility function constructed using Von Neumann & Morgenstern’s axiomatic approach (see section 8), some people use the term <em>VNM-utility</em>.</p>
<p>Now that you know that an agent’s preferences can be represented as a “utility function,” and that assignments of utility to outcomes can mean different things depending on the utility scale of the utility function, we are ready to think more formally about the challenge of making “optimal” or “rational” choices. (We will return to the problem of constructing an agent’s utility function later, in section 8.3.)</p>
<h2 id="what-do-decision-theorists-mean-by-risk-ignorance-and-uncertainty"><a href="#what-do-decision-theorists-mean-by-risk-ignorance-and-uncertainty">5. What do decision theorists mean by “risk,” “ignorance,” and “uncertainty”?</a></h2>
<p>Peterson (<a href="http://www.amazon.com/Introduction-Decision-Cambridge-Introductions-Philosophy/dp/0521716543/ref=as_li_ss_tl?ie=UTF8&camp=1789&creative=390957&creativeASIN=0321928423&linkCode=as2&tag=lesswrong-20">2009</a>, ch. 1) explains:</p>
<blockquote>
<p>In decision theory, everyday terms such as <em>risk</em>, <em>ignorance</em>, and <em>uncertainty</em> are used as technical terms with precise meanings. In decisions under risk the decision maker knows the probability of the possible outcomes, whereas in decisions under ignorance the probabilities are either unknown or non-existent. Uncertainty is either used as a synonym for ignorance, or as a broader term referring to both risk and ignorance.</p>
</blockquote>
<p>In this FAQ, a “decision under ignorance” is one in which probabilities are <em>not</em> assigned to all outcomes, and a “decision under uncertainty” is one in which probabilities <em>are</em> assigned to all outcomes. The term “risk” will be reserved for discussions related to utility.</p>
<h2 id="how-should-i-make-decisions-under-ignorance"><a href="#how-should-i-make-decisions-under-ignorance">6. How should I make decisions under ignorance?</a></h2>
<p>A decision maker faces a “decision under ignorance” when she (1) knows which acts she could choose and which outcomes they may result in, but (2) is unable to assign probabilities to the outcomes.</p>
<p>(Note that many theorists think that all decisions under ignorance can be transformed into decisions under uncertainty, in which case this section will be irrelevant except for subsection 6.1. For details, see section 7.)</p>
<h3 id="the-dominance-principle"><a href="#the-dominance-principle">6.1. The dominance principle</a></h3>
<p>To borrow an example from Peterson (<a href="http://www.amazon.com/Introduction-Decision-Cambridge-Introductions-Philosophy/dp/0521716543/ref=as_li_ss_tl?ie=UTF8&camp=1789&creative=390957&creativeASIN=0321928423&linkCode=as2&tag=lesswrong-20">2009</a>, ch. 3), suppose that Jane isn’t sure whether to order hamburger or monkfish at a new restaurant. Just about any chef can make an edible hamburger, and she knows that monkfish is fantastic if prepared by a world-class chef, but she also recalls that monkfish is difficult to cook. Unfortunately, she knows too little about this restaurant to assign any probability to the prospect of getting good monkfish. Her decision matrix might look like this:</p>
<table border="0" cellspacing="5" cellpadding="3">
<tbody>
<tr>
<td class="numeric"> </td>
<td><em>Good chef</em></td>
<td><em>Bad chef</em></td>
</tr>
<tr>
<td><em>Monkfish</em></td>
<td>good monkfish</td>
<td>terrible monkfish</td>
</tr>
<tr>
<td><em>Hamburger</em></td>
<td>edible hamburger</td>
<td>edible hamburger</td>
</tr>
<tr>
<td><em>No main course</em></td>
<td>hungry</td>
<td>hungry</td>
</tr>
</tbody>
</table>
<p>Here, decision theorists would say that the “hamburger” choice <em>dominates</em> the “no main course” choice. This is because choosing the hamburger leads to a better outcome for Jane no matter which possible state of the world (good chef or bad chef) turns out to be true.</p>
<p>This <em>dominance principle</em> comes in two forms:</p>
<ul><li><p><em>Weak dominance</em>: One act is <em>more</em> rational than another if (1) all its possible outcomes are at least as good as those of the other, and if (2) there is at least one possible outcome that is better than that of the other act.</p></li><li><p><em>Strong dominance</em>: One act is <em>more</em> rational than another if all of its possible outcome are better than that of the other act.</p></li></ul>
<div class="figure"><div class="imgonly"><img src="http://i.imgur.com/7fU6U.jpg" alt="A comparison of strong and weak dominance" loading="lazy"></div>
<p class="caption">A comparison of strong and weak dominance</p>
</div>
<p>The dominance principle can also be applied to decisions under uncertainty (in which probabilities <em>are</em> assigned to all the outcomes). If we assign probabilities to outcomes, it is still rational to choose one act over another act if all its outcomes are at least as good as the outcomes of the other act.</p>
<p>However, the dominance principle only applies (non-controversially) when the agent’s acts are independent of the state of the world. So consider the decision of whether to steal a coat:</p>
<table border="0" cellspacing="5" cellpadding="3">
<tbody>
<tr>
<td class="numeric"> </td>
<td><em>Charged with theft</em></td>
<td><em>Not charged with theft</em></td>
</tr>
<tr>
<td><em>Theft</em></td>
<td>Jail and coat</td>
<td>Freedom and coat</td>
</tr>
<tr>
<td><em>No theft</em></td>
<td>Jail</td>
<td>Freedom</td>
</tr>
</tbody>
</table>
<p>In this case, stealing the coat dominates not doing so but isn’t necessarily the rational decision. After all, stealing increases your chance of getting charged with theft and might be irrational for this reason. So dominance doesn’t apply in cases like this where the state of the world is not independent of the agents act.</p>
<p>On top of this, not all decision problems include an act that dominates all the others. Consequently additional principles are often required to reach a decision.</p>
<h3 id="maximin-and-leximin"><a href="#maximin-and-leximin">6.2. Maximin and leximin</a></h3>
<p>Some decision theorists have suggested the <em>maximin principle</em>: if the worst possible outcome of one act is better than the worst possible outcome of another act, then the former act should be chosen. In Jane’s decision problem above, the maximin principle would prescribe choosing the hamburger, because the worst possible outcome of choosing the hamburger (“edible hamburger”) is better than the worst possible outcome of choosing the monkfish (“terrible monkfish”) and is also better than the worst possible outcome of eating no main course (“hungry”).</p>
<p>If the worst outcomes of two or more acts are equally good, the maximin principle tells you to be indifferent between them. But that doesn’t seem right. For this reason, fans of the maximin principle often invoke the <em>lexical</em> maximin principle (“leximin”), which says that if the worst outcomes of two or more acts are equally good, one should choose the act for which the <em>second worst</em> outcome is best. (If that doesn’t single out a single act, then the <em>third worst</em> outcome should be considered, and so on.)</p>
<p>Why adopt the leximin principle? Advocates point out that the leximin principle transforms a decision problem under ignorance into a decision problem under partial certainty. The decision maker doesn’t know what the outcome will be, but they know what the worst possible outcome will be.</p>
<p>But in some cases, the leximin rule seems clearly irrational. Imagine this decision problem, with two possible acts and two possible states of the world:</p>
<table border="0" cellspacing="5" cellpadding="3">
<tbody>
<tr>
<td class="numeric"> </td>
<td class="numeric">s<sub>1</sub></td>
<td class="numeric">s<sub>2</sub></td>
</tr>
<tr>
<td class="numeric">a<sub>1</sub></td>
<td class="numeric">$1</td>
<td class="numeric">$10,001.01</td>
</tr>
<tr>
<td class="numeric">a<sub>2</sub></td>
<td class="numeric">$1.01</td>
<td class="numeric">$1.01</td>
</tr>
</tbody>
</table>
<p>In this situation, the leximin principle prescribes choosing a<sub>2</sub>. But most people would agree it is rational to risk losing out on a single cent for the chance to get an extra $10,000.</p>
<h3 id="maximax-and-optimism-pessimism"><a href="#maximax-and-optimism-pessimism">6.3. Maximax and optimism-pessimism</a></h3>
<p>The maximin and leximin rules focus their attention on the worst possible outcomes of a decision, but why not focus on the <em>best</em> possible outcome? The <em>maximax principle</em> prescribes that if the best possible outcome of one act is better than the best possible outcome of another act, then the former act should be chosen.</p>
<p>More popular among decision theorists is the <em>optimism-pessimism rule</em> (<em>aka</em> the <em>alpha-index rule</em>). The optimism-pessimism rule prescribes that one consider both the best and worst possible outcome of each possible act, and then choose according to one’s degree of optimism or pessimism.</p>
<p>Here’s an example from Peterson (<a href="http://www.amazon.com/Introduction-Decision-Cambridge-Introductions-Philosophy/dp/0521716543/ref=as_li_ss_tl?ie=UTF8&camp=1789&creative=390957&creativeASIN=0321928423&linkCode=as2&tag=lesswrong-20">2009</a>, ch. 3):</p>
<table border="0" cellspacing="5" cellpadding="3">
<tbody>
<tr>
<td class="numeric"> </td>
<td class="numeric">s<sub>1</sub></td>
<td class="numeric">s<sub>2</sub></td>
<td class="numeric">s<sub>3</sub></td>
<td class="numeric">s<sub>4</sub></td>
<td class="numeric">s<sub>5</sub></td>
<td class="numeric">s<sub>6</sub></td>
</tr>
<tr>
<td class="numeric">a<sub>1</sub></td>
<td class="numeric">55</td>
<td class="numeric">18</td>
<td class="numeric">28</td>
<td class="numeric">10</td>
<td class="numeric">36</td>
<td class="numeric">100</td>
</tr>
<tr>
<td class="numeric">a<sub>2</sub></td>
<td class="numeric">50</td>
<td class="numeric">87</td>
<td class="numeric">55</td>
<td class="numeric">90</td>
<td class="numeric">75</td>
<td class="numeric">70</td>
</tr>
</tbody>
</table>
<p>We represent the decision maker’s level of optimism on a scale of 0 to 1, where 0 is maximal pessimism and 1 is maximal optimism. For a<sub>1</sub>, the worst possible outcome is 10 and the best possible outcome is 100. That is, min(a<sub>1</sub>) = 10 and max(a<sub>1</sub>) = 100. So if the decision maker is 0.85 optimistic, then the total value of a<sub>1</sub> is (0.85)(100) + (1 − 0.85)(10) = 86.5, and the total value of a<sub>2</sub> is (0.85)(90) + (1 − 0.85)(50) = 84. In this situation, the optimism-pessimism rule prescribes action a<sub>1</sub>.</p>
<p>If the decision maker’s optimism is 0, then the optimism-pessimism rule collapses into the maximin rule because (0)(max(a<sub>i</sub>)) + (1 − 0)(min(a<sub>i</sub>)) = min(a<sub>i</sub>). And if the decision maker’s optimism is 1, then the optimism-pessimism rule collapses into the maximax rule. Thus, the optimism-pessimism rule turns out to be a generalization of the maximin and maximax rules. (Well, sort of. The minimax and maximax principles require only that we measure value on an ordinal scale, whereas the optimism-pessimism rule requires that we measure value on an interval scale.)</p>
<p>The optimism-pessimism rule pays attention to both the best-case and worst-case scenarios, but is it rational to ignore all the outcomes in between? Consider this example:</p>
<table border="0" cellspacing="5" cellpadding="3">
<tbody>
<tr>
<td class="numeric"> </td>
<td class="numeric">s<sub>1</sub></td>
<td class="numeric">s<sub>2</sub></td>
<td class="numeric">s<sub>3</sub></td>
</tr>
<tr>
<td class="numeric">a<sub>1</sub></td>
<td class="numeric">1</td>
<td class="numeric">2</td>
<td class="numeric">100</td>
</tr>
<tr>
<td class="numeric">a<sub>2</sub></td>
<td class="numeric">1</td>
<td class="numeric">99</td>
<td class="numeric">100</td>
</tr>
</tbody>
</table>
<p>The maximum and minimum values for a<sub>1</sub> and a<sub>2</sub> are the same, so for every degree of optimism both acts are equally good. But it seems obvious that one should choose a<sub>2</sub>.</p>
<h3 id="other-decision-principles"><a href="#other-decision-principles">6.4. Other decision principles</a></h3>
<p>Many other decision principles for dealing with decisions under ignorance have been proposed, including <a href="http://teaching.ust.hk/~bee/papers/misc/Regret%20Theory%20An%20Alternative%20Theory%20of%20Rational%20Choice%20Under%20Uncertainty.pdf">minimax regret</a>, <a href="http://www.amazon.com/Info-Gap-Decision-Theory-Second-Edition/dp/0123735521/ref=as_li_ss_tl?ie=UTF8&camp=1789&creative=390957&creativeASIN=0321928423&linkCode=as2&tag=lesswrong-20">info-gap</a>, and <a href="http://www.existential-risk.org/concept.pdf">maxipok</a>. For more details on making decisions under ignorance, see <a href="http://www.amazon.com/Introduction-Decision-Cambridge-Introductions-Philosophy/dp/0521888379/ref=as_li_ss_tl?ie=UTF8&camp=1789&creative=390957&creativeASIN=0321928423&linkCode=as2&tag=lesswrong-20">Peterson (2009)</a> and <a href="http://www.dss.dpem.tuc.gr/pdf/Choice%20under%20complete%20uncertainty%20-%20axiomatic%20characterizati.pdf">Bossert et al. (2000)</a>.</p>
<p>One queer feature of the decision principles discussed in this section is that they willfully disregard some information relevant to making a decision. Such a move could make sense when trying to find a decision algorithm that performs well under tight limits on available computation (<a href="http://www.dss.dpem.tuc.gr/pdf/An%20axiomatic%20treatment%20of%20three%20qualitative%20decision%20criteri.pdf">Brafman & Tennenholtz (2000)</a>), but it’s unclear why an <em>ideal</em> agent with infinite computing power (fit for a <em>normative</em> rather than a <em>prescriptive</em> theory) should willfully disregard information.</p>
<h2 id="can-decisions-under-ignorance-be-transformed-into-decisions-under-uncertainty"><a href="#can-decisions-under-ignorance-be-transformed-into-decisions-under-uncertainty">7. Can decisions under ignorance be transformed into decisions under uncertainty?</a></h2>
<p>Can decisions under ignorance be transformed into decisions under uncertainty? This would simplify things greatly, because there is near-universal agreement that decisions under uncertainty should be handled by “maximizing expected utility” (see section 11 for clarifications), whereas decision theorists still debate what should be done about decisions under ignorance.</p>
<p>For <a href="http://en.wikipedia.org/wiki/Bayesian_probability">Bayesians</a> (see section 10), <em>all</em> decisions under ignorance are transformed into decisions under uncertainty (<a href="http://www.amazon.com/Introduction-Bayesian-Inference-Decision-Edition/dp/0964793849/ref=as_li_ss_tl?ie=UTF8&camp=1789&creative=390957&creativeASIN=0321928423&linkCode=as2&tag=lesswrong-20">Winkler 2003</a>, ch. 5) when the decision maker assigns an “ignorance prior” to each outcome for which they don’t know how to assign a probability. (Another way of saying this is to say that a Bayesian decision maker never faces a decision under ignorance, because a Bayesian must always assign a prior probability to events.) One must then consider how to assign priors, an important debate among Bayesians (see section 10).</p>
<p>Many non-Bayesian decision theorists also think that decisions under ignorance can be transformed into decisions under uncertainty due to something called the <em>principle of insufficient reason</em>. The principle of insufficient reason prescribes that if you have literally <em>no</em> reason to think that one state is more probable than another, then one should assign <em>equal</em> probability to both states.</p>
<p>One objection to the principle of insufficient reason is that it is very sensitive to how states are individuated. Peterson (<a href="http://www.amazon.com/Introduction-Decision-Cambridge-Introductions-Philosophy/dp/0521716543/ref=as_li_ss_tl?ie=UTF8&camp=1789&creative=390957&creativeASIN=0321928423&linkCode=as2&tag=lesswrong-20">2009</a>, ch. 3) explains:</p>
<blockquote>
<p>Suppose that before embarking on a trip you consider whether to bring an umbrella or not. [But] you know nothing about the weather at your destination. If the formalization of the decision problem is taken to include only two states, viz. rain and no rain, [then by the principle of insufficient reason] the probability of each state will be <span class="frac"><sup>1</sup>⁄<sub>2</sub></span>. However, it seems that one might just as well go for a formalization that divides the space of possibilities into three states, viz. heavy rain, moderate rain, and no rain. If the principle of insufficient reason is applied to the latter set of states, their probabilities will be <span class="frac"><sup>1</sup>⁄<sub>3</sub></span>. In some cases this difference will affect our decisions. Hence, it seems that anyone advocating the principle of insufficient reason must [defend] the rather implausible hypothesis that there is only one correct way of making up the set of states.</p>
</blockquote>
<div class="figure"><div class="imgonly"><img src="http://i.imgur.com/kXn03.jpg" alt="An objection to the principle of insufficient reason" loading="lazy"></div>
<p class="caption">An objection to the principle of insufficient reason</p>
</div>
<p>Advocates of the principle of insufficient reason might respond that one must consider <em>symmetric</em> states. For example if someone gives you a die with <em>n</em> sides and you have no reason to think the die is biased, then you should assign a probability of 1/<em>n</em> to each side. But, Peterson notes:</p>
<blockquote>
<p>...not all events can be described in symmetric terms, at least not in a way that justifies the conclusion that they are equally probable. Whether Ann’s marriage will be a happy one depends on her future emotional attitude toward her husband. According to one description, she could be either in love or not in love with him; then the probability of both states would be <span class="frac"><sup>1</sup>⁄<sub>2</sub></span>. According to another equally plausible description, she could either be deeply in love, a little bit in love or not at all in love with her husband; then the probability of each state would be <span class="frac"><sup>1</sup>⁄<sub>3</sub></span>.</p>
</blockquote>
<h2 id="how-should-i-make-decisions-under-uncertainty"><a href="#how-should-i-make-decisions-under-uncertainty">8. How should I make decisions under uncertainty?</a></h2>
<p>A decision maker faces a “decision under uncertainty” when she (1) knows which acts she could choose and which outcomes they may result in, and she (2) assigns probabilities to the outcomes.</p>
<p>Decision theorists generally agree that when facing a decision under uncertainty, it is rational to choose the act with the highest expected utility. This is the principle of <em>expected utility maximization</em> (EUM).</p>
<p>Decision theorists offer two kinds of justifications for EUM. The first has to do with the law of large numbers (see section 8.1). The second has to do with the axiomatic approach (see sections 8.2 through 8.6).</p>
<h3 id="the-law-of-large-numbers"><a href="#the-law-of-large-numbers">8.1. The law of large numbers</a></h3>
<p>The “law of large numbers,” which states that <em>in the long run</em>, if you face the same decision problem again and again and again, and you always choose the act with the highest expected utility, then you will almost certainly be better off than if you choose any other acts.</p>
<p>There are two problems with using the law of large numbers to justify EUM. The first problem is that the world is ever-changing, so we rarely if ever face the same decision problem “again and again and again.” The law of large numbers says that if you face the same decision problem infinitely many times, then the probability that you could do better by not maximizing expected utility approaches zero. But you won’t ever face the same decision problem infinitely many times! Why should you care what would happen if a certain condition held, if you know that condition will never hold?</p>
<p>The second problem with using the law of large numbers to justify EUM has to do with a mathematical theorem known as <em>gambler’s ruin</em>. Imagine that you and I flip a fair coin, and I pay you $1 every time it comes up heads and you pay me $1 every time it comes up tails. We both start with $100. If we flip the coin enough times, one of us will face a situation in which the sequence of heads or tails is longer than we can afford. If a long-enough sequence of heads comes up, I’ll run out of $1 bills with which to pay you. If a long-enough sequence of tails comes up, you won’t be able to pay me. So in this situation, the law of large numbers guarantees that you will be better off in the long run by maximizing expected utility only if you start the game with an infinite amount of money (so that you never go broke), which is an unrealistic assumption. (For technical convenience, assume utility increases linearly with money. But the basic point holds without this assumption.)</p>
<h3 id="the-axiomatic-approach"><a href="#the-axiomatic-approach">8.2. The axiomatic approach</a></h3>
<p>The other method for justifying EUM seeks to show that EUM can be derived from axioms that hold regardless of what happens in the long run.</p>
<p>In this section we will review perhaps the most famous axiomatic approach, from <a href="http://www.amazon.com/Economic-Behavior-Commemorative-Princeton-Editions/dp/0691130612/ref=as_li_ss_tl?ie=UTF8&camp=1789&creative=390957&creativeASIN=0321928423&linkCode=as2&tag=lesswrong-20">Von Neumann and Morgenstern (1947)</a>. Other axiomatic approaches include <a href="http://www.amazon.com/The-Foundations-Statistics-Leonard-Savage/dp/0486623491/ref=as_li_ss_tl?ie=UTF8&camp=1789&creative=390957&creativeASIN=0321928423&linkCode=as2&tag=lesswrong-20">Savage (1954)</a>, <a href="http://www.amazon.com/The-Logic-Decision-Richard-Jeffrey/dp/0226395820/ref=as_li_ss_tl?ie=UTF8&camp=1789&creative=390957&creativeASIN=0321928423&linkCode=as2&tag=lesswrong-20">Jeffrey (1983)</a>, and <a href="http://pages.stern.nyu.edu/~dbackus/Exotic/1Ambiguity/AnscombeAumann%20AMS%2063.pdf">Anscombe & Aumann (1963)</a>.</p>
<h3 id="the-von-neumann-morgenstern-utility-theorem"><a href="#the-von-neumann-morgenstern-utility-theorem">8.3. The Von Neumann-Morgenstern utility theorem</a></h3>
<p>The first decision theory axiomatization appeared in an appendix to the second edition of Von Neumann & Morgenstern’s <em><a href="http://www.amazon.com/Economic-Behavior-Commemorative-Princeton-Editions/dp/0691130612/ref=as_li_ss_tl?ie=UTF8&camp=1789&creative=390957&creativeASIN=0321928423&linkCode=as2&tag=lesswrong-20">Theory of Games and Economic Behavior</a></em> (1947). An important point to note up front is that, in this axiomatization, Von Neumann and Morgenstern take the options that the agent chooses between to not be acts, as we’ve defined them, but lotteries (where a lottery is a set of outcomes, each paired with a probability). As such, while discussing their axiomatization, we will talk of lotteries. (Despite making this distinction, acts and lotteries are closely related. Under the conditions of uncertainty that we are considering here, each act will be associated with some lottery and so preferences over lotteries could be used to determine preferences over acts, if so desired).</p>
<p>The key feature of the Von Neumann and Morgenstern axiomatization is a proof that if a decision maker states her preferences over a set of lotteries, and if her preferences conform to a set of intuitive structural constraints (axioms), then we can construct a utility function (on an interval scale) from her preferences over lotteries and show that she acts <em>as if</em> she maximizes expected utility with respect to that utility function.</p>
<p>What are the axioms to which an agent’s preferences over lotteries must conform? There are four of them.</p>
<ol style=""><li>
<p>The <em>completeness axiom</em> states that the agent must <em>bother to state a preference</em> for each pair of lotteries. That is, the agent must prefer A to B, or prefer B to A, or be indifferent between the two.</p>
</li><li>
<p>The <em>transitivity axiom</em> states that if the agent prefers A to B and B to C, she must also prefer A to C.</p>
</li><li>
<p>The <em>independence axiom</em> states that, for example, if an agent prefers an apple to an orange, then she must also prefer the lottery [55% chance she gets an apple, otherwise she gets cholera] over the lottery [55% chance she gets an orange, otherwise she gets cholera]. More generally, this axiom holds that a preference must hold independently of the possibility of another outcome (e.g. cholera).</p>
</li><li>
<p>The <em>continuity axiom</em> holds that if the agent prefers A to B to C, then there exists a unique <em>p</em> (probability) such that the agent is indifferent between [<em>p</em>(A) + (1 - <em>p</em>)(C)] and [outcome B with certainty].</p>
</li></ol>
<p>The continuity axiom requires <a href="http://www.youtube.com/watch?v=hSUsiA8dhKM">more explanation</a>. Suppose that A = $1 million, B = $0, and C = Death. If <em>p</em> = 0.5, then the agent’s two lotteries under consideration for the moment are:</p>
<ol style=""><li><p>(0.5)($1M) + (1 − 0.5)(Death) [win $1M with 50% probability, die with 50% probability]</p></li><li><p>(1)($0) [win $0 with certainty]</p></li></ol>
<p>Most people would <em>not</em> be indifferent between $0 with certainty and [50% chance of $1M, 50% chance of Death] — the risk of Death is too high! But if you have continuous preferences, there is <em>some</em> probability <em>p</em> for which you’d be indifferent between these two lotteries. Perhaps <em>p</em> is very, very high:</p>
<ol style=""><li><p>(0.999999)($1M) + (1 − 0.999999)(Death) [win $1M with 99.9999% probability, die with 0.0001% probability]</p></li><li><p>(1)($0) [win $0 with certainty]</p></li></ol>
<p>Perhaps now you’d be indifferent between lottery 1 and lottery 2. Or maybe you’d be <em>more</em> willing to risk Death for the chance of winning $1M, in which case the <em>p</em> for which you’d be indifferent between lotteries 1 and 2 is lower than 0.999999. As long as there is <em>some</em> <em>p</em> at which you’d be indifferent between lotteries 1 and 2, your preferences are “continuous.”</p>
<p>Given this setup, Von Neumann and Morgenstern proved their theorem, which states that if the agent’s preferences over lotteries obeys their axioms, then:</p>
<ul><li><p>The agent’s preferences can be represented by a utility function that assigns higher utility to preferred lotteries.</p></li><li><p>The agent acts in accordance with the principle of maximizing expected utility.</p></li><li><p>All utility functions satisfying the above two conditions are “positive linear transformations” of each other. (Without going into the details: this is why VNM-utility is measured on an interval scale.)</p></li></ul>
<h3><a href="#vnm-utility-theory-and-rationality">8.4. VNM utility theory and rationality</a></h3>
<p>An agent which conforms to the VNM axioms is sometimes said to be “VNM-rational.” But why should “VNM-rationality” constitute our notion of <em>rationality in general</em>? How could VNM’s result justify the claim that a rational agent maximizes expected utility when facing a decision under uncertainty? The argument goes like this:</p>
<ol style=""><li><p>If an agent chooses lotteries which it prefers (in decisions under uncertainty), and if its preferences conform to the VNM axioms, then it is rational. Otherwise, it is irrational.</p></li><li><p>If an agent chooses lotteries which it prefers (in decisions under uncertainty), and if its preferences conform to the VNM axioms, then it maximizes expected utility.</p></li><li><p>Therefore, a rational agent maximizes expected utility (in decisions under uncertainty).</p></li></ol>
<p>Von Neumann and Morgenstern proved premise 2, and the conclusion follows from premise 1 and 2. But why accept premise 1?</p>
<p>Few people deny that it would be irrational for an agent to choose a lottery which it does not prefer. But why is it irrational for an agent’s preferences to violate the VNM axioms? I will save that discussion for section 8.6.</p>
<h3 id="objections-to-vnm-rationality"><a href="#objections-to-vnm-rationality">8.5. Objections to VNM-rationality</a></h3>
<p>Several objections have been raised to Von Neumann and Morgenstern’s result:</p>
<ol style=""><li>
<p><em>The VNM axioms are too strong</em>. Some have argued that the VNM axioms are not self-evidently true. See section 8.6.</p>
</li><li>
<p><em>The VNM system offers no action guidance</em>. A VNM-rational decision maker cannot use VNM utility theory for action guidance, because she must state her preferences over lotteries at the start. But if an agent can state her preferences over lotteries, then she already knows which lottery to choose. (For more on this, see section 9.)</p>
</li><li>
<p><em>In the VNM system, utility is defined via preferences over lotteries rather than preferences over outcomes</em>. To many, it seems odd to <em>define</em> utility with respect to preferences over lotteries. Many would argue that utility should be defined in relation to preferences over <em>outcomes</em> or <em>world-states</em>, and that’s not what the VNM system does. (Also see section 9.)</p>
</li></ol>
<h3><a href="#should-we-accept-the-vnm-axioms">8.6. Should we accept the VNM axioms?</a></h3>
<p>The VNM preference axioms define what it is for an agent to be VNM-rational. But why should we accept these axioms? Usually, it is argued that each of the axioms are <em>pragmatically justified</em> because an agent which violates the axioms can face situations in which they are guaranteed end up worse off (from <em>their own</em> perspective).</p>
<p>In sections 8.6.1 and 8.6.2 I go into some detail about pragmatic justifications offered for the transitivity and completeness axioms. For more detail, including arguments about the justification of the other axioms, see <a href="http://www.amazon.com/Introduction-Decision-Cambridge-Introductions-Philosophy/dp/0521888379/ref=as_li_ss_tl?ie=UTF8&camp=1789&creative=390957&creativeASIN=0321928423&linkCode=as2&tag=lesswrong-20">Peterson (2009, ch. 8)</a> and <a href="http://www.amazon.com/Foundations-Rational-Choice-Under-Risk/dp/0198774427/ref=as_li_ss_tl?ie=UTF8&camp=1789&creative=390957&creativeASIN=0321928423&linkCode=as2&tag=lesswrong-20">Anand (1993)</a>.</p>
<h4 id="the-transitivity-axiom"><a href="#the-transitivity-axiom">8.6.1. The transitivity axiom</a></h4>
<p>Consider the <em>money-pump argument</em> in favor of the transitivity axiom (“if the agent prefers A to B and B to C, she must also prefer A to C”).</p>
<blockquote>
<p>Imagine that a friend offers to give you exactly one of her three… novels, x or y or z… [and] that your preference ordering over the three novels is… [that] you prefer x to y, and y to z, and z to x… [That is, your preferences are <em>cyclic</em>, which is a type of <em>intransitive</em> preference relation.] Now suppose that you are in possession of z, and that you are invited to swap z for y. Since you prefer y to z, rationality obliges you to swap. So you swap, and temporarily get y. You are then invited to swap y for x, which you do, since you prefer x to y. Finally, you are offered to <em>pay a small amount</em>, say one cent, for swapping x for z. Since z is strictly [preferred to] x, even after you have paid the fee for swapping, rationality tells you that you should accept the offer. This means that you end up where you started, the only difference being that you now have one cent less. This procedure is thereafter iterated over and over again. After a billion cycles you have lost ten million dollars, for which you have got nothing in return. (<a href="http://www.amazon.com/Introduction-Decision-Cambridge-Introductions-Philosophy/dp/0521716543/ref=as_li_ss_tl?ie=UTF8&camp=1789&creative=390957&creativeASIN=0321928423&linkCode=as2&tag=lesswrong-20">Peterson 2009</a>, ch. 8)</p>
</blockquote>
<div class="figure"><div class="imgonly"><img src="http://i.imgur.com/45csd.jpg" alt="An example of a money-pump argument" loading="lazy"></div>
<p class="caption">An example of a money-pump argument</p>
</div>
<p>Similar arguments (e.g. <a href="http://johanegustafsson.net/papers/a_money-pump_for_acyclic_intransitive_preferences.pdf">Gustafsson 2010</a>) aim to show that the other kind of intransitive preferences (acyclic preferences) are irrational, too.</p>
<p>(Of course, pragmatic arguments need not be framed in monetary terms. We could just as well construct an argument showing that an agent with intransitive preferences can be “pumped” of all their happiness, or all their moral virtue, or all their Twinkies.)</p>
<h4 id="the-completeness-axiom"><a href="#the-completeness-axiom">8.6.2. The completeness axiom</a></h4>
<p>The completeness axiom (“the agent must prefer A to B, or prefer B to A, or be indifferent between the two”) is often attacked by saying that some goods or outcomes are incommensurable — that is, they cannot be compared. For example, must a rational agent be able to state a preference (or indifference) between money and human welfare?</p>
<p>Perhaps the completeness axiom can be justified with a pragmatic argument. If you think it is rationally permissible to swap between two incommensurable goods, then one can construct a money pump argument in favor of the completeness axiom. But if you think it is <em>not</em> rational to swap between incommensurable goods, then one cannot construct a money pump argument for the completeness axiom. (In fact, even if it is rational to swap between incommensurable goods, <a href="http://personal.rhul.ac.uk/uhte/035/incomplete%20preferences.geb.pdf">Mandler, 2005</a> has demonstrated that an agent that allows their current choices to depend on the previous ones can avoid being money pumped.)</p>
<p>And in fact, there is a popular argument <em>against</em> the completeness axiom: the “small improvement argument.” For details, see <a href="http://www.amazon.com/Incommensurability-Incomparability-Practical-Reason-Chang/dp/0674447565/ref=as_li_ss_tl?ie=UTF8&camp=1789&creative=390957&creativeASIN=0321928423&linkCode=as2&tag=lesswrong-20">Chang (1997)</a> and <a href="https://www.msb.se/Upload/Om%20MSB/Forskning/Projektrapporter/Peterson_artiklar/Small_Improvment_Argument.pdf">Espinoza (2007)</a>.</p>
<p>Note that in <a href="http://en.wikipedia.org/wiki/Revealed_preference">revealed preference theory</a>, according to which preferences are revealed through choice behavior, there is no room for incommensurable preferences because every choice always reveals a preference relation of “better than,” “worse than,” or “equally as good as.”</p>
<p>Another proposal for dealing with the apparent incommensurability of some goods (such as money and human welfare) is the <em>multi-attribute approach</em>:</p>
<blockquote>
<p>In a multi-attribute approach, each type of attribute is measured in the unit deemed to be most suitable for that attribute. Perhaps money is the right unit to use for measuring financial costs, whereas the number of lives saved is the right unit to use for measuring human welfare. The total value of an alternative is thereafter determined by aggregating the attributes, e.g. money and lives, into an overall ranking of available alternatives...</p>
</blockquote>
<blockquote>
<p>Several criteria have been proposed for choosing among alternatives with multiple attributes… [For example,] additive criteria assign weights to each attribute, and rank alternatives according to the weighted sum calculated by multiplying the weight of each attribute with its value… [But while] it is perhaps contentious to measure the utility of very different objects on a common scale, …it seems equally contentious to assign numerical weights to attributes as suggested here....</p>
</blockquote>
<blockquote>
<p>[Now let us] consider a very general objection to multi-attribute approaches. According to this objection, there exist several equally plausible but different ways of constructing the list of attributes. Sometimes the outcome of the decision process depends on which set of attributes is chosen. (<a href="http://www.amazon.com/Introduction-Decision-Cambridge-Introductions-Philosophy/dp/0521716543/ref=as_li_ss_tl?ie=UTF8&camp=1789&creative=390957&creativeASIN=0321928423&linkCode=as2&tag=lesswrong-20">Peterson 2009</a>, ch. 8)</p>
</blockquote>
<p>For more on the multi-attribute approach, see <a href="http://www.amazon.com/Decisions-Multiple-Objectives-Preferences-Tradeoffs/dp/0521438837/ref=as_li_ss_tl?ie=UTF8&camp=1789&creative=390957&creativeASIN=0321928423&linkCode=as2&tag=lesswrong-20">Keeney & Raiffa (1993)</a>.</p>
<h4 id="the-allais-paradox"><a href="#the-allais-paradox">8.6.3. The Allais paradox</a></h4>
<p>Having considered the transitivity and completeness axioms, we can now turn to independence (a preference holds independently of considerations of other possible outcomes). Do we have any reason to reject this axiom? Here’s one reason to think we might: in a case known as the <em>Allais paradox</em> <a href="http://www.jstor.org/stable/1907921">Allais (1953)</a> it may seem reasonable to act in a way that contradicts independence.</p>
<p>The Allais paradox asks us to consider two decisions (this version of the paradox is based on <a href="https://www.greaterwrong.com/posts/zJZvoiwydJ5zvzTHK/the-allais-paradox">Yudkowsky (2008)</a>).The first decision involves the choice between:</p>
<p>(1A) A certain $24,000; and (1B) A <span class="frac"><sup>33</sup>⁄<sub>34</sub></span> chance of $27,000 and a <span class="frac"><sup>1</sup>⁄<sub>34</sub></span> chance of nothing.</p>
<p>The second involves the choice between:</p>
<p>(2A) A 34% chance of $24, 000 and a 66% chance of nothing; and (2B) A 33% chance of $27, 000 and a 67% chance of nothing.</p>
<p>Experiments have shown that many people prefer (1A) to (1B) and (2B) to (2A). However, these preferences contradict independence. Option 2A is the same as [a 34% chance of option 1A and a 66% chance of nothing] while 2B is the same as [a 34% chance of option 1B and a 66% chance of nothing]. So independence implies that anyone that prefers (1A) to (1B) must also prefer (2A) to (2B).</p>
<p>When this result was first uncovered, it was presented as evidence against the independence axiom. However, while the Allais paradox clearly reveals that independence fails as a <em>descriptive</em> account of choice, it’s less clear what it implies about the normative account of rational choice that we are discussing in this document. As noted in <a href="http://www.amazon.com/Introduction-Decision-Cambridge-Introductions-Philosophy/dp/0521716543/ref=as_li_ss_tl?ie=UTF8&camp=1789&creative=390957&creativeASIN=0321928423&linkCode=as2&tag=lesswrong-20">Peterson (2009, ch. 4)</a>, however:</p>
<blockquote>
<p>[S]ince many people who have thought very hard about this example still feel that it would be rational to stick to the problematic preference pattern described above, there seems to be something wrong with the expected utility principle.</p>
</blockquote>
<p>However, Peterson then goes on to note that, many people, like the statistician Leonard Savage, argue that it is people’s preference in the Allais paradox that are in error rather than the independence axiom. If so, then the paradox seems to reveal the danger of relying too strongly on intuition to determine the form that should be taken by normative theories of rational.</p>
<h4 id="the-ellsberg-paradox"><a href="#the-ellsberg-paradox">8.6.4. The Ellsberg paradox</a></h4>
<p>The Allais paradox is far from the only case where people fail to act in accordance with EUM. Another well-known case is the Ellsberg paradox (the following is taken from <a href="http://www.amazon.com/Choices-An-Introduction-Decision-Theory/dp/0816614407/ref=as_li_ss_tl?ie=UTF8&camp=1789&creative=390957&creativeASIN=0321928423&linkCode=as2&tag=lesswrong-20">Resnik (1987)</a>:</p>
<blockquote>
<p>An urn contains ninety uniformly sized balls, which are randomly distributed. Thirty of the balls are yellow, the remaining sixty are red or blue. We are not told how many red (blue) balls are in the urn – except that they number anywhere from zero to sixty. Now consider the following pair of situations. In each situation a ball will be drawn and we will be offered a bet on its color. In situation A we will choose between betting that it is yellow or that it is red. In situation B we will choose between betting that it is red or blue or that it is yellow or blue.</p>
</blockquote>
<p>If we guess the correct color, we will receive a payout of $100. In the Ellsberg paradox, many people bet <em>yellow</em> in situation A and <em>red or blue</em> in situation B. Further, many people make these decisions not because they are indifferent in both situations, and so happy to choose either way, but rather because they have a strict preference to choose in this manner.</p>
<div class="figure"><div class="imgonly"><img src="http://i.imgur.com/tZKOsHx.jpg" alt="The Ellsberg paradox" loading="lazy"></div>
<p class="caption">The Ellsberg paradox</p>
</div>
<p>However, such behavior cannot be in accordance with EUM. In order for EUM to endorse a strict preference for choosing <em>yellow</em> in situation A, the agent would have to assign a probability of more than <span class="frac"><sup>1</sup>⁄<sub>3</sub></span> to the ball selected being blue. On the other hand, in order for EUM to endorse a strict preference for choosing <em>red or blue</em> in situation B the agent would have to assign a probability of less than <span class="frac"><sup>1</sup>⁄<sub>3</sub></span> to the selected ball being blue. As such, these decisions can’t be jointly endorsed by an agent following EUM.</p>
<p>Those who deny that decisions making under ignorance can be transformed into decision making under uncertainty have an easy response to the Ellsberg paradox: as this case involves deciding under a situation of ignorance, it is irrelevant whether people’s decisions violate EUM in this case as EUM is not applicable to such situations.</p>
<p>Those who believe that EUM provides a suitable standard for choice in such situations, however, need to find some other way of responding to the paradox. As with the Allais paradox, there is some disagreement about how best to do so. Once again, however, many people, including Leonard Savage, argue that EUM reaches the right decision in this case. It is our intuitions that are flawed (see again <a href="http://www.amazon.com/Choices-An-Introduction-Decision-Theory/dp/0816614407/ref=as_li_ss_tl?ie=UTF8&camp=1789&creative=390957&creativeASIN=0321928423&linkCode=as2&tag=lesswrong-20">Resnik (1987)</a> for a nice summary of Savage’s argument to this conclusion).</p>
<h4 id="the-st-petersburg-paradox"><a href="#the-st-petersburg-paradox">8.6.5. The St Petersburg paradox</a></h4>
<p>Another objection to the VNM approach (and to expected utility approaches generally), the <a href="http://en.wikipedia.org/wiki/St._Petersburg_paradox">St. Petersburg paradox</a>, draws on the possibility of infinite utilities. The St. Petersburg paradox is based around a game where a fair coin is tossed until it lands heads up. At this point, the agent receives a prize worth 2<sup>n</sup> utility, where <em>n</em> is equal to the number of times the coin was tossed during the game. The so-called paradox occurs because the expected utility of choosing to play this game is infinite and so, according to a standard expected utility approach, the agent should be willing to pay any finite amount to play the game. However, this seems unreasonable. Instead, it seems that the agent should only be willing to pay a relatively small amount to do so. As such, it seems that the expected utility approach gets something wrong.</p>
<p>Various responses have been suggested. Most obviously, we could say that the paradox does not apply to VNM agents, since the VNM theorem assigns real numbers to all lotteries, and infinity is not a real number. But it’s unclear whether this escapes the problem. After all, at it’s core, the St. Petersburg paradox is not about infinite utilities but rather about cases where expected utility approaches seem to overvalue some choice, and such cases seem to exist even in finite cases. For example, if we let <em>L</em> be a finite limit on utility we could consider the following scenario (from <a href="http://www.amazon.com/Introduction-Decision-Cambridge-Introductions-Philosophy/dp/0521888379/ref=as_li_ss_tl?ie=UTF8&camp=1789&creative=390957&creativeASIN=0321928423&linkCode=as2&tag=lesswrong-20">Peterson, 2009, p. 85</a>):</p>
<blockquote>
<p>A fair coin is tossed until it lands heads up. The player thereafter receives a prize worth min {2<sup>n</sup> · 10<sup>-100</sup>, L} units of utility, where <em>n</em> is the number of times the coin was tossed.</p>
</blockquote>
<p>In this case, even if an extremely low value is set for <em>L</em>, it seems that paying this amount to play the game is unreasonable. After all, as Peterson notes, about nine times out of ten an agent that plays this game will win no more than 8 · 10<sup>-100</sup> utility. If paying 1 utility is, in fact, unreasonable in this case, then simply limiting an agent’s utility to some finite value doesn’t provide a defence of expected utility approaches. (Other problems abound. See <a href="https://www.greaterwrong.com/posts/a5JAiTdytou3Jg749/pascal-s-mugging-tiny-probabilities-of-vast-utilities">Yudkowsky, 2007</a> for an interesting finite problem and <a href="http://philrsss.anu.edu.au/people-defaults/alanh/papers/vexing_expectations.pdf">Nover & Hajek, 2004</a> for a particularly perplexing problem with links to the St Petersburg paradox.)</p>
<p>As it stands, there is no agreement about precisely what the St Petersburg paradox reveals. Some people accept one of the various resolutions of the case and so find the paradox unconcerning. Others think the paradox reveals a serious problem for expected utility theories. Still others think the paradox is unresolved but don’t think that we should respond by abandoning expected utility theory.</p>
<h2 id="does-axiomatic-decision-theory-offer-any-action-guidance"><a href="#does-axiomatic-decision-theory-offer-any-action-guidance">9. Does axiomatic decision theory offer any action guidance?</a></h2>
<p>For the decision theories listed in section 8.2, it’s often claimed the answer is “no.” To explain this, I must first examine some differences between <em>direct</em> and <em>indirect</em> approaches to axiomatic decision theory.</p>
<p><a href="http://www.amazon.com/Introduction-Decision-Cambridge-Introductions-Philosophy/dp/0521888379/ref=as_li_ss_tl?ie=UTF8&camp=1789&creative=390957&creativeASIN=0321928423&linkCode=as2&tag=lesswrong-20">Peterson (2009, ch. 4)</a> explains:</p>
<blockquote>
<p>In the indirect approach, which is the dominant approach, the decision maker does not prefer a risky act [or lottery] to another <em>because</em> the expected utility of the former exceeds that of the latter. Instead, the decision maker is asked to state a set of preferences over a set of risky acts… Then, if the set of preferences stated by the decision maker is consistent with a small number of structural constraints (axioms), it can be shown that her decisions can be described <em>as if</em> she were choosing what to do by assigning numerical probabilities and utilities to outcomes and then maximising expected utility...</p>
</blockquote>
<blockquote>
<p>[In contrast] the direct approach seeks to generate preferences over acts from probabilities and utilities <em>directly</em> assigned to outcomes. In contrast to the indirect approach, it is not assumed that the decision maker has access to a set of preferences over acts before he starts to deliberate.</p>
</blockquote>
<p>The axiomatic decision theories listed in section 8.2 all follow the indirect approach. These theories, it might be said, cannot offer any action guidance because they require an agent to state its preferences over acts “up front.” But an agent that states its preferences over acts already knows which act it prefers, so the decision theory can’t offer any action guidance not already present in the agent’s own stated preferences over acts.</p>
<p><a href="http://www.amazon.com/Introduction-Decision-Cambridge-Introductions-Philosophy/dp/0521888379/ref=as_li_ss_tl?ie=UTF8&camp=1789&creative=390957&creativeASIN=0321928423&linkCode=as2&tag=lesswrong-20">Peterson (2009, ch .10)</a> gives a practical example:</p>
<blockquote>
<p>For example, a forty-year-old woman seeking advice about whether to, say, divorce her husband, is likely to get very different answers from the [two approaches]. The [indirect approach] will advise the woman to first figure out what her preferences are over a very large set of risky acts, including the one she is thinking about performing, and then just make sure that all preferences are consistent with certain structural requirements. Then, as long as none of the structural requirements is violated, the woman is free to do whatever she likes, no matter what her beliefs and desires actually are… The [direct approach] will [instead] advise the woman to first assign numerical utilities and probabilities to her desires and beliefs, and then aggregate them into a decision by applying the principle of maximizing expected utility.</p>
</blockquote>
<p>Thus, it seems only the direct approach offers an agent any action guidance. But the direct approach is very recent (<a href="http://www.amazon.com/Non-Bayesian-Decision-Theory-Beliefs-Desires/dp/9048179572/ref=as_li_ss_tl?ie=UTF8&camp=1789&creative=390957&creativeASIN=0321928423&linkCode=as2&tag=lesswrong-20">Peterson 2008</a>; <a href="http://commonsenseatheism.com/wp-content/uploads/2012/05/Cozic-Review-of-Non-Bayesian-Decision-Theory.pdf">Cozic 2011</a>), and only time will show whether it can stand up to professional criticism.</p>
<p>Warning: Peterson’s (<a href="http://www.amazon.com/Non-Bayesian-Decision-Theory-Beliefs-Desires/dp/9048179572/ref=as_li_ss_tl?ie=UTF8&camp=1789&creative=390957&creativeASIN=0321928423&linkCode=as2&tag=lesswrong-20">2008</a>) direct approach is confusingly called “non-Bayesian decision theory” despite assuming Bayesian probability theory.</p>
<p>For other attempts to pull action guidance from normative decision theory, see <a href="https://www.greaterwrong.com/posts/F46jPraqp258q67nE/why-you-must-maximize-expected-utility">Fallenstein (2012)</a> and <a href="https://www.greaterwrong.com/posts/oRRpsGkCZHA3pzhvm/a-fungibility-theorem">Stiennon (2013)</a>.</p>
<h2 id="how-does-probability-theory-play-a-role-in-decision-theory"><a href="#how-does-probability-theory-play-a-role-in-decision-theory">10. How does probability theory play a role in decision theory?</a></h2>
<p>In order to calculate the expected utility of an act (or lottery), it is necessary to determine a probability for each outcome. In this section, I will explore some of the details of probability theory and its relationship to decision theory.</p>
<p>For further introductory material to probability theory, see <a href="http://www.amazon.com/Scientific-Reasoning-The-Bayesian-Approach/dp/081269578X/ref=as_li_ss_tl?ie=UTF8&camp=1789&creative=390957&creativeASIN=0321928423&linkCode=as2&tag=lesswrong-20">Howson & Urbach (2005)</a>, <a href="http://www.amazon.com/Probability-Random-Processes-Geoffrey-Grimmett/dp/0198572220/ref=as_li_ss_tl?ie=UTF8&camp=1789&creative=390957&creativeASIN=0321928423&linkCode=as2&tag=lesswrong-20">Grimmet & Stirzacker (2001)</a>, and <a href="http://www.amazon.com/Probabilistic-Graphical-Models-Principles-Computation/dp/0262013193/ref=as_li_ss_tl?ie=UTF8&camp=1789&creative=390957&creativeASIN=0321928423&linkCode=as2&tag=lesswrong-20">Koller & Friedman (2009)</a>. This section draws heavily on <a href="http://www.amazon.com/Introduction-Decision-Cambridge-Introductions-Philosophy/dp/0521888379/ref=as_li_ss_tl?ie=UTF8&camp=1789&creative=390957&creativeASIN=0321928423&linkCode=as2&tag=lesswrong-20">Peterson (2009, chs. 6 & 7)</a> which provides a very clear introduction to probability in the context of decision theory.</p>
<h3 id="the-basics-of-probability-theory"><a href="#the-basics-of-probability-theory">10.1. The basics of probability theory</a></h3>
<p>Intuitively, a probability is a number between 0 or 1 that labels how likely an event is to occur. If an event has probability 0 then it is impossible and if it has probability 1 then it can’t possibly be false. If an event has a probability between these values, then this event it is more probable the higher this number is.</p>
<p>As with EUM, probability theory can be derived from a small number of simple axioms. In the probability case, there are three of these, which are named the Kolmogorov axioms after the mathematician Andrey Kolmogorov. The first of these states that probabilities are real numbers between 0 and 1. The second, that if a set of events are mutually exclusive and exhaustive then their probabilities should sum to 1. The third that if two events are mutually exclusive then the probability that one or the other of these events will occur is equal to the sum of their individual probabilities.</p>
<p>From these three axioms, the remainder of probability theory can be derived. In the remainder of this section, I will explore some aspects of this broader theory.</p>
<h3 id="bayes-theorem-for-updating-probabilities"><a href="#bayes-theorem-for-updating-probabilities">10.2. Bayes theorem for updating probabilities</a></h3>
<p>From the perspective of decision theory, one particularly important aspect of probability theory is the idea of a conditional probability. These represent how probable something is given a piece of information. So, for example, a conditional probability could represent how likely it is that it will be raining, conditioning on the fact that the weather forecaster predicted rain. A powerful technique for calculating conditional probabilities is Bayes theorem (see <a href="http://yudkowsky.net/rational/bayes">Yudkowsky, 2003</a> for a detailed introduction). This formula states that:</p>
<div class="figure"><div class="imgonly"><img src="http://i.imgur.com/lTKXA.gif" alt="P(A|B)=(P(B|A)P(A))/P(B)" loading="lazy"></div>
<p class="caption">P(A|B)=(P(B|A)P(A))/P(B)</p>
</div>
<p>Bayes theorem is used to calculate the probability of some event, A, given some evidence, B. As such, this formula can be used to <em>update</em> probabilities based on new evidence. So if you are trying to predict the probability that it will rain tomorrow and someone gives you the information that the weather forecaster predicted that it will do so then this formula tells you how to calculate a new probability that it will rain based on your existing information. The initial probability in such cases (before the information is factored into account) is called the <em>prior probability</em> and the result of applying Bayes theorem is a new, <em>posterior probability</em>.</p>
<div class="figure"><div class="imgonly"><img src="http://i.imgur.com/vM0yW.jpg" alt="Using Bayes theorem to update probabilities based on the evidence provided by a weather forecast" loading="lazy"></div>
<p class="caption">Using Bayes theorem to update probabilities based on the evidence provided by a weather forecast</p>
</div>
<p>Bayes theorem can be seen as solving the problem of how to update prior probabilities based on new information. However, it leaves open the question of how to determine the prior probability in the first place. In some cases, there will be no obvious way to do so. One solution to this problem suggests that any reasonable prior can be selected. Given enough evidence, repeated applications of Bayes theorem will lead this prior probability to be updated to much the same posterior probability, even for people with widely different initial priors. As such, the initially selected prior is less crucial than it may at first seem.</p>
<h3 id="how-should-probabilities-be-interpreted"><a href="#how-should-probabilities-be-interpreted">10.3. How should probabilities be interpreted?</a></h3>
<p>There are two main views about what probabilities mean: objectivism and subjectivism. Loosely speaking, the objectivist holds that probabilities tell us something about the external world while the subjectivist holds that they tell us something about our beliefs. Most decision theorists hold a subjectivist view about probability. According to this sort of view, probabilities represent a subjective degrees of belief. So to say the probability of rain is 0.8 is to say that the agent under consideration has a high degree of belief that it will rain (see <a href="http://www.amazon.com/Probability-Theory-The-Logic-Science/dp/0521592712/ref=as_li_ss_tl?ie=UTF8&camp=1789&creative=390957&creativeASIN=0321928423&linkCode=as2&tag=lesswrong-20">Jaynes, 2003</a> for a defense of this view). Note that, according to this view, another agent in the same circumstance could assign a different probability that it will rain.</p>
<h4 id="why-should-degrees-of-belief-following-the-laws-of-probability"><a href="#why-should-degrees-of-belief-following-the-laws-of-probability">10.3.1. Why should degrees of belief follow the laws of probability?</a></h4>
<p>One question that might be raised against the subjective account of probability is why, on this account, our degrees of belief should satisfy the Kolmogorov axioms. For example, why should our subjective degrees of belief in mutually exclusive, exhaustive events add to 1? One answer to this question shows that agents whose degrees of belief don’t satisfy these axioms will be subject to Dutch Book bets. These are bets where the agent will inevitably lose money. <a href="http://www.amazon.com/Introduction-Decision-Cambridge-Introductions-Philosophy/dp/0521888379/ref=as_li_ss_tl?ie=UTF8&camp=1789&creative=390957&creativeASIN=0321928423&linkCode=as2&tag=lesswrong-20">Peterson (2009, ch. 7)</a> explains:</p>
<blockquote>
<p>Suppose, for instance, that you believe to degree 0.55 that at least one person from India will win a gold medal in the next Olympic Games (event G), and that your subjective degree of belief is 0.52 that no Indian will win a gold medal in the next Olympic Games (event ¬G). Also suppose that a cunning bookie offers you a bet on both of these events. The bookie promises to pay you $1 for each event that actually takes place. Now, since your subjective degree of belief that G will occur is 0.55 it would be rational to pay up to $1·0.55 = $0.55 for entering this bet. Furthermore, since your degree of belief in ¬G is 0.52 you should be willing to pay up to $0.52 for entering the second bet, since $1·0.52 = $0.52. However, by now you have paid $1.07 for taking on two bets that are certain to give you a payoff of $1 <em>no matter what happens</em>...Certainly, this must be irrational. Furthermore, the reason why this is irrational is that your subjective degrees of belief violate the probability calculus.</p>
</blockquote>
<div class="figure"><div class="imgonly"><img src="http://i.imgur.com/9xoLg.jpg" alt="A Dutch Book argument" loading="lazy"></div>
<p class="caption">A Dutch Book argument</p>
</div>
<p>It can be proven that an agent is subject to Dutch Book bets if, and only if, their degrees of belief violate the axioms of probability. This provides an argument for why degrees of beliefs should satisfy these axioms.</p>
<h4 id="measuring-subjective-probabilities"><a href="#measuring-subjective-probabilities">10.3.2. Measuring subjective probabilities</a></h4>
<p>Another challenges raised by the subjective view is how we can measure probabilities. If these represent subjective degrees of belief there doesn’t seem to be an easy way to determine these based on observations of the world. However, a number of responses to this problem have been advanced, one of which is explained succinctly by <a href="http://www.amazon.com/Introduction-Decision-Cambridge-Introductions-Philosophy/dp/0521888379/ref=as_li_ss_tl?ie=UTF8&camp=1789&creative=390957&creativeASIN=0321928423&linkCode=as2&tag=lesswrong-20">Peterson (2009, ch. 7)</a>:</p>
<blockquote>
<p>The main innovations presented by… Savage can be characterised as systematic procedures for linking probability… to claims about objectively observable behavior, such as preference revealed in choice behavior. Imagine, for instance, that we wish to measure Caroline’s subjective probability that the coin she is holding in her hand will land heads up the next time it is tossed. First, we ask her which of the following very generous options she would prefer.</p>
</blockquote>
<blockquote>
<p>A: “If the coin lands heads up you win a sports car; otherwise you win nothing.”</p>
</blockquote>
<blockquote>
<p>B: “If the coin <em>does not</em> land heads up you win a sports car; otherwise you win nothing.”</p>
</blockquote>
<blockquote>
<p>Suppose Caroline prefers A to B. We can then safely conclude that she thinks it is <em>more probable</em> that the coin will land heads up rather than not. This follows from the assumption that Caroline prefers to win a sports car rather than nothing, and that her preference between uncertain prospects is entirely determined by her beliefs and desires with respect to her prospects of winning the sports car...</p>
</blockquote>
<blockquote>
<p>Next, we need to generalise the measurement procedure outlined above such that it allows us to always represent Caroline’s degrees of belief with precise numerical probabilities. To do this, we need to ask Caroline to state preferences over a <em>much larger</em> set of options and then <em>reason backwards</em>… Suppose, for instance, that Caroline wishes to measure her subjective probability that her car worth $20,000 will be stolen within one year. If she considers $1,000 to be… the highest price she is prepared to pay for a gamble in which she gets $20,000 if the event S: “The car stolen within a year” takes place, and nothing otherwise, then Caroline’s subjective probability for S is <span class="frac"><sup>1,000</sup>⁄<sub>20,000</sub></span> = 0.05, given that she forms her preferences in accordance with the principle of maximising expected monetary value...</p>
</blockquote>
<blockquote>
<p>The problem with this method is that very few people form their preferences in accordance with the principle of maximising expected monetary value. Most people have a decreasing marginal utility for money...</p>
</blockquote>
<blockquote>
<p>Fortunately, there is a clever solution to [this problem]. The basic idea is to impose a number of structural conditions on preferences over uncertain options [e.g. the transitivity axiom]. Then, the subjective probability function is established by reasoning backwards while taking the structural axioms into account: Since the decision maker preferrred some uncertain options to others, and her preferences… satisfy a number of structure axioms, the decision maker behaves <em>as if</em> she were forming her preferences over uncertain options by first assigning subjective probabilities and utilities to each option and thereafter maximising expected utility.</p>
</blockquote>
<blockquote>
<p>A peculiar feature of this approach is, thus, that probabilities (and utilities) are derived from ‘within’ the theory. The decision maker does not prefer an uncertain option to another <em>because</em> she judges the subjective probabilities (and utilities) of the outcomes to be more favourable than those of another. Instead, the… structure of the decision maker’s preferences over uncertain options logically implies that they can be described <em>as if</em> her choices were governed by a subjective probability function and a utility function...</p>
</blockquote>
<blockquote>
<p>...Savage’s approach [seeks] to explicate subjective interpretations of the probability axioms by making certain claims about preferences over… uncertain options. But… why on earth should a theory of subjective probability involve assumptions about preferences, given that preferences and beliefs are separate entities? Contrary to what is claimed by [Savage and others], emotionally inert decision makers failing to muster any preferences at all… could certainly hold partial beliefs.</p>
</blockquote>
<p>Other theorists, for example <a href="http://www.amazon.com/Optimal-Statistical-Decisions-Classics-Library/dp/047168029X/ref=as_li_ss_tl?ie=UTF8&camp=1789&creative=390957&creativeASIN=0321928423&linkCode=as2&tag=lesswrong-20">DeGroot (1970)</a>, propose other approaches:</p>
<blockquote>
<p>DeGroot’s basic assumption is that decision makers can make <em>qualitative</em> comparisons between pairs of events, and judge which one they think is most likely to occur. For example, he assumes that one can judge whether it is <em>more</em>, <em>less</em>, or <em>equally</em> likely, according to one’s own beliefs, that it will rain today in Cambridge than in Cairo. DeGroot then shows that if the agent’s qualitative judgments are sufficiently fine-grained and satisfy a number of structural axioms, then [they can be described by a probability distribution]. So in DeGroot’s… theory, the probability function is obtained by fine-tuning qualitative data, thereby making them quantitative.</p>
</blockquote>
<h2><a href="#what-about-newcombs-problem-and-alternative-decision-algorithms">11. What about “Newcomb’s problem” and alternative decision algorithms?</a></h2>
<p>Saying that a rational agent “maximizes expected utility” is, unfortunately, not specific enough. There are a variety of decision algorithms which aim to maximize expected utility, and they give <em>different answers</em> to some decision problems, for example “Newcomb’s problem.”</p>
<p>In this section, we explain these decision algorithms and show how they perform on Newcomb’s problem and related “Newcomblike” problems.</p>
<p>General sources on this topic include: <a href="http://www.amazon.com/Paradoxes-Rationality-Cooperation-Prisoners-Newcombs/dp/0774802154/ref=as_li_ss_tl?ie=UTF8&camp=1789&creative=390957&creativeASIN=0321928423&linkCode=as2&tag=lesswrong-20">Campbell & Sowden (1985)</a>, <a href="http://kops.ub.uni-konstanz.de/bitstream/handle/urn:nbn:de:bsz:352-opus-5241/ledwig.pdf?sequence=1">Ledwig (2000)</a>, <a href="http://www.amazon.com/Foundations-Decision-Cambridge-Probability-Induction/dp/0521063566/ref=as_li_ss_tl?ie=UTF8&camp=1789&creative=390957&creativeASIN=0321928423&linkCode=as2&tag=lesswrong-20">Joyce (1999)</a>, and <a href="http://intelligence.org/files/TDT.pdf">Yudkowsky (2010)</a>. <a href="http://www.operalgo.com/PDF/Moertelmaier_Newcomblike_2013.pdf">Moertelmaier (2013)</a> discusses Newcomblike problems in the context of the agent-environment framework.</p>
<h3 id="newcomblike-problems-and-two-decision-algorithms"><a href="#newcomblike-problems-and-two-decision-algorithms">11.1. Newcomblike problems and two decision algorithms</a></h3>
<p>I’ll begin with an exposition of several Newcomblike problems, so that I can refer to them in later sections. I’ll also introduce our first two decision algorithms, so that I can show how one’s choice of decision algorithm affects an agent’s outcomes on these problems.</p>
<h4 id="newcombs-problem"><a href="#newcombs-problem">11.1.1. Newcomb’s Problem</a></h4>
<p>Newcomb’s problem was formulated by the physicist <a href="http://en.wikipedia.org/wiki/William_Newcomb">William Newcomb</a> but first published in <a href="http://faculty.arts.ubc.ca/rjohns/nozick_newcomb.pdf">Nozick (1969)</a>. Below I present a version of it inspired by <a href="http://intelligence.org/files/TDT.pdf">Yudkowsky (2010)</a>.</p>
<p>A superintelligent machine named Omega visits Earth from another galaxy and shows itself to be very good at predicting events. This isn’t because it has magical powers, but because it knows more science than we do, has billions of sensors scattered around the globe, and runs efficient algorithms for modeling humans and other complex systems with unprecedented precision — on an array of computer hardware the size of our moon.</p>
<p>Omega presents you with two boxes. Box A is transparent and contains $1000. Box B is opaque and contains either $1 million or nothing. You may choose to take both boxes (called “two-boxing”), or you may choose to take only box B (called “one-boxing”). If Omega predicted you’ll two-box, then Omega has left box B empty. If Omega predicted you’ll one-box, then Omega has placed $1M in box B.</p>
<p>By the time you choose, Omega has already left for its next game — the contents of box B won’t change after you make your decision. Moreover, you’ve watched Omega play a thousand games against people like you, and on every occasion Omega predicted the human player’s choice accurately.</p>
<p>Should you one-box or two-box?</p>
<div class="figure"><div class="imgonly"><img src="http://i.imgur.com/4MFhs.jpg" alt="Newcomb’s problem" loading="lazy"></div>
<p class="caption">Newcomb’s problem</p>
</div>
<p>Here’s an argument for two-boxing. The $1M either <em>is</em> or <em>is not</em> in the box; your choice cannot affect the contents of box B now. So, you should two-box, because then you get $1K plus whatever is in box B. This is a straightforward application of the dominance principle (section 6.1). Two-boxing dominantes one-boxing.</p>
<p>Convinced? Well, here’s an argument for one-boxing. On all those earlier games you watched, everyone who two-boxed received $1K, and everyone who one-boxed received $1M. So you’re almost certain that you’ll get $1K for two-boxing and $1M for one-boxing, which means that to maximize your expected utility, you should one-box.</p>
<p><a href="http://faculty.arts.ubc.ca/rjohns/nozick_newcomb.pdf">Nozick (1969)</a> reports:</p>
<blockquote>
<p>I have put this problem to a large number of people… To almost everyone it is perfectly clear and obvious what should be done. The difficulty is that these people seem to divide almost evenly on the problem, with large numbers thinking that the opposing half is just being silly.</p>
</blockquote>
<p>This is not a “merely verbal” dispute (<a href="http://philreview.dukejournals.org/content/120/4/515.short">Chalmers 2011</a>). Decision theorists have offered different <em>algorithms</em> for making a choice, and they have different outcomes. Translated into English, the first algorithm (<em>evidential decision theory</em> or EDT) says “Take actions such that you would be glad to receive the news that you had taken them.” The second algorithm (<em>causal decision theory</em> or CDT) says “Take actions which you expect to have a positive effect on the world.”</p>
<p>Many decision theorists have the intuition that CDT is right. But a CDT agent appears to “lose” on Newcomb’s problem, ending up with $1000, while an EDT agent gains $1M. Proponents of EDT can ask proponents of CDT: “If you’re so smart, why aren’t you rich?” As <a href="http://www-ihpst.univ-paris1.fr/fichiers/programmes/20/Spohn-One-Boxing3.pdf">Spohn (2012)</a> writes, “this must be poor rationality that complains about the reward for irrationality.” Or as <a href="http://intelligence.org/files/TDT.pdf">Yudkowsky (2010)</a> argues:</p>
<blockquote>
<p>An expected utility maximizer should maximize <em>utility</em> — not formality, reasonableness, or defensibility...</p>
</blockquote>
<p>In response to EDT’s apparent “win” over CDT on Newcomb’s problem, proponents of CDT have presented similar problems on which a CDT agent “wins” and an EDT agent “loses.” Proponents of EDT, meanwhile, have replied with additional Newcomblike problems on which EDT wins and CDT loses. Let’s explore each of them in turn.</p>
<h4 id="evidential-and-causal-decision-theory"><a href="#evidential-and-causal-decision-theory">11.1.2. Evidential and causal decision theory</a></h4>
<p>First, however, we will consider our two decision algorithms in a little more detail.</p>
<p>EDT can be described simply: according to this theory, agents should use conditional probabilities when determining the expected utility of different acts. Specifically, they should use the probability of the world being in each possible state conditioning on them carrying out the act under consideration. So in Newcomb’s problem they consider the probability that Box B contains $1 million or nothing conditioning on the evidence provided by their decision to one-box or two-box. This is how the theory formalizes the notion of an act providing good news.</p>
<p>CDT is more complex, at least in part because it has been formulated in a variety of different ways and these formulations are equivalent to one another only if certain background assumptions are met. However, a good sense of the theory can be gained by considering the counterfactual approach, which is one of the more intuitive of these formulations. This approach utilizes the probabilities of certain counterfactual conditionals, which can be thought of as representing the causal influence of an agent’s acts on the state of the world. These conditionals take the form “if I were to carry out a certain act, then the world would be in a certain state.” So in Newcomb’s problem, for example, this formulation of CDT considers the probability of the counterfactuals like “if I were to one-box, then Box B would contain $1 million” and, in doing so, considers the causal influence of one-boxing on the contents of the boxes.</p>
<p>The same distinction can be made in formulaic terms. Both EDT and CDT agree that decision theory should be about maximizing expected utility where the expected utility of an act, A, given a set of possible outcomes, O, is defined as follows:</p>
<p><div class="imgonly"><img src="http://i.imgur.com/CSwK4.gif" alt="expected utility formula" loading="lazy"></div>.</p>
<p>In this equation, V(A & O) represents the value to the agent of the combination of an act and an outcome. So this is the utility that the agent will receive if they carry out a certain act and a certain outcome occurs. Further, Pr<sub>A</sub>O represents the probability of each outcome occurring on the supposition that the agent carries out a certain act. It is in terms of this probability that CDT and EDT differ. EDT uses the conditional probability, Pr(O|A), while CDT uses the probability of subjunctive conditionals, Pr(A <div class="imgonly"><img src="http://i.imgur.com/G8xec.gif" alt="" loading="lazy"></div> O).</p>
<p>Using these two versions of the expected utility formula, it’s possible to demonstrate in a formal manner why EDT and CDT give the advice they do in Newcomb’s problem. To demonstrate this it will help to make two simplifying assumptions. First, we will presume that each dollar of money is worth 1 unit of utility to the agent (and so will presume that the agent’s utility is linear with money). Second, we will presume that Omega is a perfect predictor of human actions so that if the agent two-boxes it provides definitive evidence that there is nothing in the opaque box and if the agent one-boxes it provides definitive evidence that there is $1 million in this box. Given these assumptions, EDT calculates the expected utility of each decision as follows:</p>
<div class="figure"><div class="imgonly"><img src="http://i.imgur.com/aCe4Y.gif" alt="EU for two-boxing according to EDT" loading="lazy"></div>
<p class="caption">EU for two-boxing according to EDT</p>
</div>
<div class="figure"><div class="imgonly"><img src="http://i.imgur.com/vJtVr.gif" alt="EU for one-boxing according to EDT" loading="lazy"></div>
<p class="caption">EU for one-boxing according to EDT</p>
</div>
<p>Given that one-boxing has a higher expected utility according to these calculations, an EDT agent will one-box.</p>
<p>On the other hand, given that the agent’s decision doesn’t causally influence Omega’s earlier prediction, CDT will use the same probability regardless of whether you one or two box. The decision endorsed will be the same regardless of what probability we use so, to demonstrate the theory, we can simply arbitrarily assign an 0.5 probability that the opaque box has nothing in it and an 0.5 probability that it has one million dollars in it. CDT then calculates the expected utility of each decision as follows:</p>
<div class="figure"><div class="imgonly"><img src="http://i.imgur.com/oyHGl.gif" alt="EU for two-boxing according to CDT" loading="lazy"></div>
<p class="caption">EU for two-boxing according to CDT</p>
</div>
<div class="figure"><div class="imgonly"><img src="http://i.imgur.com/7uX9t.gif" alt="EU for one-boxing according to CDT" loading="lazy"></div>
<p class="caption">EU for one-boxing according to CDT</p>
</div>
<p>Given that two-boxing has a higher expected utility according to these calculations, a CDT agent will two-box. This approach demonstrates the result given more informally in the previous section: CDT agents will two-box in Newcomb’s problem and EDT agents will one box.</p>
<p>As mentioned before, there are also alternative formulations of CDT. What are these? For example, David Lewis <a href="http://www.tandfonline.com/doi/abs/10.1080/00048408112340011">(1981)</a> and Brian Skyrms <a href="http://www.amazon.com/Causal-Necessity-Pragmatic-Investigation-Laws/dp/0300023391/ref=as_li_ss_tl?ie=UTF8&camp=1789&creative=390957&creativeASIN=0321928423&linkCode=as2&tag=lesswrong-20">(1980)</a> both present approaches that rely on the partition of the world into states to capture causal information, rather than counterfactual conditionals. On Lewis’s version of this account, for example, the agent calculates the expected utility of acts using their unconditional credence in states of the world that are <em>dependency hypotheses</em>, which are descriptions of the possible ways that the world can depend on the agent’s actions. These dependency hypotheses intrinsically contain the required causal information.</p>
<p>Other traditional approaches to CDT include the imaging approach of <a href="http://commonsenseatheism.com/wp-content/uploads/2012/09/Sobel-Probability-Chance-and-Choice-a-Theory-of-Rational-Agency.pdf">Sobel (1980)</a> (also see <a href="http://www.tandfonline.com/doi/abs/10.1080/00048408112340011">Lewis 1981</a>) and the unconditional expectations approach of Leonard Savage <a href="http://www.amazon.com/Foundations-Statistics-Leonard-J-Savage/dp/0486623491/ref=as_li_ss_tl?ie=UTF8&camp=1789&creative=390957&creativeASIN=0321928423&linkCode=as2&tag=lesswrong-20">(1954)</a>. Those interested in the various traditional approaches to CDT would be best to consult Lewis <a href="http://www.tandfonline.com/doi/abs/10.1080/00048408112340011">(1981)</a>, <a href="http://plato.stanford.edu/entries/decision-causal/">Weirich (2008)</a>, and <a href="http://www.amazon.com/Foundations-Decision-Cambridge-Probability-Induction/dp/0521063566/ref=as_li_ss_tl?ie=UTF8&camp=1789&creative=390957&creativeASIN=0321928423&linkCode=as2&tag=lesswrong-20">Joyce (1999)</a>. More recently, work in computer science on a tool called causal Bayesian networks has led to an innovative approach to CDT that has received some recent attention in the philosophical literature (<a href="http://www.amazon.com/Causality-Reasoning-Inference-Judea-Pearl/dp/0521773628/ref=as_li_ss_tl?ie=UTF8&camp=1789&creative=390957&creativeASIN=0321928423&linkCode=as2&tag=lesswrong-20">Pearl 2000, ch. 4</a> and <a href="http://www-ihpst.univ-paris1.fr/fichiers/programmes/20/Spohn-One-Boxing3.pdf">Spohn 2012</a>).</p>
<p>Now we return to an analysis of decision scenarios, armed with EDT and the counterfactual formulation of CDT.</p>
<h4 id="medical-newcomb-problems"><a href="#medical-newcomb-problems">11.1.3. Medical Newcomb problems</a></h4>
<p>Medical Newcomb problems share a similar form but come in many variants, including Solomon’s problem (<a href="https://www.kellogg.northwestern.edu/research/math/papers/194.pdf">Gibbard & Harper 1976</a>) and the smoking lesion problem (<a href="http://fitelson.org/few/few_05/egan.pdf">Egan 2007</a>). Below I present a variant called the “chewing gum problem” (<a href="http://intelligence.org/files/TDT.pdf">Yudkowsky 2010</a>):</p>
<blockquote>
<p>Suppose that a recently published medical study shows that chewing gum seems to cause throat abscesses — an outcome-tracking study showed that of people who chew gum, 90% died of throat abscesses before the age of 50. Meanwhile, of people who do not chew gum, only 10% die of throat abscesses before the age of 50. The researchers, to explain their results, wonder if saliva sliding down the throat wears away cellular defenses against bacteria. Having read this study, would you choose to chew gum? But now a second study comes out, which shows that most gum-chewers have a certain gene, CGTA, and the researchers produce a table showing the following mortality rates:</p>
</blockquote>
<blockquote>
<table border="0">
<tbody>
<tr>
<td class="numeric"> </td>
<td>CGTA present</td>
<td>CGTA absent</td>
</tr>
<tr>
<td>Chew Gum</td>
<td>89% die</td>
<td>8% die</td>
</tr>
<tr>
<td>Don’t chew</td>
<td>99% die</td>
<td>11% die</td>
</tr>
</tbody>
</table>
</blockquote>
<blockquote>
<p>This table shows that whether you have the gene CGTA or not, your chance of dying of a throat abscess goes down if you chew gum. Why are fatalities so much higher for gum-chewers, then? Because people with the gene CGTA tend to chew gum and die of throat abscesses. The authors of the second study also present a test-tube experiment which shows that the saliva from chewing gum can kill the bacteria that form throat abscesses. The researchers hypothesize that because people with the gene CGTA are highly susceptible to throat abscesses, natural selection has produced in them a tendency to chew gum, which protects against throat abscesses. The strong correlation between chewing gum and throat abscesses is not because chewing gum causes throat abscesses, but because a third factor, CGTA, leads to chewing gum and throat abscesses.</p>
</blockquote>
<blockquote>
<p>Having learned of this new study, would you choose to chew gum? Chewing gum helps protect against throat abscesses whether or not you have the gene CGTA. Yet a friend who heard that you had decided to chew gum (as people with the gene CGTA often do) would be quite alarmed to hear the news — just as she would be saddened by the news that you had chosen to take both boxes in Newcomb’s Problem. This is a case where [EDT] seems to return the wrong answer, calling into question the validity of the… rule “Take actions such that you would be glad to receive the news that you had taken them.” Although the news that someone has decided to chew gum is alarming, medical studies nonetheless show that chewing gum protects against throat abscesses. [CDT’s] rule of “Take actions which you expect to have a positive physical effect on the world” seems to serve us better.</p>
</blockquote>
<p>One response to this claim, called the <em>tickle defense</em> (<a href="http://www.jstor.org/discover/10.2307/20115662?uid=3737536&uid=2129&uid=2&uid=70&uid=4&sid=21101205363271">Eells, 1981</a>), argues that EDT actually reaches the right decision in such cases. According to this defense, the most reasonable way to construe the “chewing gum problem” involves presuming that CGTA causes a desire (a mental “tickle”) which then causes the agent to be more likely to chew gum, rather than CGTA directly causing the action. Given this, if we presume that the agent already knows their own desires and hence already knows whether they’re likely to have the CGTA gene, chewing gum will not provide the agent with further bad news. Consequently, an agent following EDT will chew in order to get the good news that they have decreased their chance of getting abscesses.</p>
<p>Unfortunately, the tickle defense fails to achieve its aims. In introducing this approach, Eells hoped that EDT could be made to mimic CDT but without an allegedly inelegant reliance on causation. However, <a href="http://www.amazon.com/Taking-Chances-Cambridge-Probability-Induction/dp/0521038987/ref=as_li_ss_tl?ie=UTF8&camp=1789&creative=390957&creativeASIN=0321928423&linkCode=as2&tag=lesswrong-20">Sobel (1994, ch. 2)</a> demonstrated that the tickle defense failed to ensure that EDT and CDT would decide equivalently in all cases. On the other hand, those who feel that EDT originally got it right by one-boxing in Newcomb’s problem will be disappointed to discover that the tickle defense leads an agent to two-box in some versions of Newcomb’s problem and so solves one problem for the theory at the expense of introducing another.</p>
<p>So just as CDT “loses” on Newcomb’s problem, EDT will “lose” on Medical Newcomb problems (if the tickle defense fails) or will join CDT and “lose” on Newcomb’s Problem itself (if the tickle defense succeeds).</p>
<h4 id="newcombs-soda"><a href="#newcombs-soda">11.1.4. Newcomb’s soda</a></h4>
<p>There are also similar problematic cases for EDT where the evidence provided by your decision relates not to a feature that you were born (or created) with but to some other feature of the world. One such scenario is the <em>Newcomb’s soda</em> problem, introduced in <a href="http://intelligence.org/files/TDT.pdf">Yudkowsky (2010)</a>:</p>
<blockquote>
<p>You know that you will shortly be administered one of two sodas in a double-blind clinical test. After drinking your assigned soda, you will enter a room in which you find a chocolate ice cream and a vanilla ice cream. The first soda produces a strong but entirely subconscious desire for chocolate ice cream, and the second soda produces a strong subconscious desire for vanilla ice cream. By “subconscious” I mean that you have no introspective access to the change, any more than you can answer questions about individual neurons firing in your cerebral cortex. You can only infer your changed tastes by observing which kind of ice cream you pick.</p>
</blockquote>
<blockquote>
<p>It so happens that all participants in the study who test the Chocolate Soda are rewarded with a million dollars after the study is over, while participants in the study who test the Vanilla Soda receive nothing. But subjects who actually eat vanilla ice cream receive an additional thousand dollars, while subjects who actually eat chocolate ice cream receive no additional payment. You can choose one and only one ice cream to eat. A pseudo-random algorithm assigns sodas to experimental subjects, who are evenly divided (50/50) between Chocolate and Vanilla Sodas. You are told that 90% of previous research subjects who chose chocolate ice cream did in fact drink the Chocolate Soda, while 90% of previous research subjects who chose vanilla ice cream did in fact drink the Vanilla Soda. Which ice cream would you eat?</p>
</blockquote>
<div class="figure"><div class="imgonly"><img src="http://i.imgur.com/FAZnb.jpg" alt="Newcomb’s soda" loading="lazy"></div>
<p class="caption">Newcomb’s soda</p>
</div>
<p>In this case, an EDT agent will decide to eat chocolate ice cream as this would provide evidence that they drank the chocolate soda and hence that they will receive $1 million after the experiment. However, this seems to be the wrong decision and so, once again, the EDT agent “loses”.</p>
<h4 id="bostroms-meta-newcomb-problem"><a href="#bostroms-meta-newcomb-problem">11.1.5. Bostrom’s meta-Newcomb problem</a></h4>
<p>In response to attacks on their theory, the proponent of EDT can present alternative scenarios where EDT “wins” and it is CDT that “loses”. One such case is the <em>meta-Newcomb problem</em> proposed in <a href="http://www.nickbostrom.com/papers/newcomb.html">Bostrom (2001)</a>. Adapted to fit my earlier story about Omega the superintelligent machine (section 11.1.1), the problem runs like this: Either Omega has <em>already</em> placed $1M or nothing in box B (depending on its prediction about your choice), or else Omega is watching as you choose and <em>after</em> your choice it will place $1M into box B only if you have one-boxed. But you don’t know which is the case. Omega makes its move before the human player’s choice about half the time, and the rest of the time it makes its move <em>after</em> the player’s choice.</p>
<p>But now suppose there is another superintelligent machine, Meta-Omega, who has a perfect track record of predicting both Omega’s choices and the choices of human players. Meta-Omega tells you that either you will two-box and Omega will “make its move” <em>after</em> you make your choice, or else you will one-box and Omega has <em>already</em> made its move (and gone on to the next game, with someone else).</p>
<p>Here, an EDT agent one-boxes and walks away with a million dollars. On the face of it, however, a CDT agent faces a dilemma: if she two-boxes then Omega’s action depends on her choice, so the “rational” choice is to one-box. But if the CDT agent one-boxes, then Omega’s action temporally precedes (and is thus physically independent of) her choice, so the “rational” action is to two-box. It might seem, then, that a CDT agent will be unable to reach any decision in this scenario. However, further reflection reveals that the issue is more complicated. According to CDT, what the agent ought to do in this scenario depends on their credences about their own actions. If they have a high credence that they will two-box, they ought to one-box and if they have a high credence that they will one-box, they ought to two box. Given that the agent’s credences in their actions are not given to us in the description of the meta-Newcomb problem, the scenario is underspecified and it is hard to know what conclusions should be drawn from it.</p>
<h4 id="the-psychopath-button"><a href="#the-psychopath-button">11.1.6. The psychopath button</a></h4>
<p>Fortunately, another case has been introduced where, according to CDT, what an agent ought to do depends on their credences about what they will do. This is the <em>psychopath button</em>, introduced in <a href="http://philreview.dukejournals.org/content/116/1/93.citation">Egan (2007)</a>:</p>
<blockquote>
<p>Paul is debating whether to press the “kill all psychopaths” button. It would, he thinks, be much better to live in a world with no psychopaths. Unfortunately, Paul is quite confident that only a psychopath would press such a button. Paul very strongly prefers living in a world with psychopaths to dying. Should Paul press the button?</p>
</blockquote>
<p>Many people think Paul should not. After all, if he does so, he is almost certainly a psychopath and so pressing the button will almost certainly cause his death. This is also the response that an EDT agent will give. After all, pushing the button would provide the agent with the bad news that they are almost certainly a psychopath and so will die as a result of their action.</p>
<p>On the other hand, if Paul is fairly certain that he is not a psychopath, then CDT will say that he ought to press the button. CDT will note that, given Paul’s confidence that he isn’t a psychopath, his decision will almost certainly have a positive impact as it will result in the death of all psychopaths and Paul’s survival. On the face of it, then, a CDT agent would decide inappropriately in this case by pushing the button. Importantly, unlike in the meta-Newcomb problem, the agent’s credences about their own behavior are specified in Egan’s full version of this scenario (in non-numeric terms, the agent thinks they’re unlikely to be a psychopath and hence unlikely to press the button).</p>
<p>However, in order to produce this problem for CDT, Egan made a number of assumptions about how an agent should decide when what they ought to do depends on what they think they will do. In response, alternative views about deciding in such cases have been advanced (particular in <a href="http://www.jstor.org/discover/10.2307/40267481?uid=3737536&uid=2&uid=4&sid=21101299066461">Arntzenius, 2008</a> and <a href="http://rd.springer.com/article/10.1007/s11229-011-0022-6">Joyce, 2012</a>). Given these factors, opinions are split about whether the psychopath button problem does in fact pose a challenge to CDT.</p>
<h4 id="parfits-hitchhiker"><a href="#parfits-hitchhiker">11.1.7. Parfit’s hitchhiker</a></h4>
<p>Not all decision scenarios are problematic for just one of EDT or CDT. There are also cases that can be presented where both an EDT agent and a CDT agent will both “lose”. One such case is <em>Parfit’s Hitchhiker</em> (<a href="http://www.amazon.com/Reasons-Persons-Oxford-Paperbacks-Parfit/dp/019824908X/ref=as_li_ss_tl?ie=UTF8&camp=1789&creative=390957&creativeASIN=0321928423&linkCode=as2&tag=lesswrong-20">Parfit, 1984, p. 7</a>):</p>
<blockquote>
<p>Suppose that I am driving at midnight through some desert. My car breaks down. You are a stranger, and the only other driver near. I manage to stop you, and I offer you a great reward if you rescue me. I cannot reward you now, but I promise to do so when we reach my home. Suppose next that I am <em>transparent</em>, unable to deceive others. I cannot lie convincingly. Either a blush, or my tone of voice, always gives me away. Suppose, finally, that I know myself to be never self-denying. If you drive me to my home, it would be worse for me if I gave you the promised reward. Since I know that I never do what will be worse for me, I know that I shall break my promise. Given my inability to lie convincingly, you know this too. You do not believe my promise, and therefore leave me stranded in the desert.</p>
</blockquote>
<p>In this scenario the agent “loses” if they would later refuse to give the stranger the reward. However, both EDT agents and CDT agents will refuse to do so. After all, by this point the agent will already be safe so giving the reward can neither provide good news about, nor cause, their safety. So this seems to be a case where both theories “lose”.</p>
<h4 id="transparent-newcombs-problem"><a href="#transparent-newcombs-problem">11.1.8. Transparent Newcomb’s problem</a></h4>
<p>There are also other cases where both EDT and CDT “lose”. One of these is the <em>Transparent Newcomb’s problem</em> which, in at least one version, is due to <a href="http://www.amazon.com/Good-Real-Demystifying-Paradoxes-Bradford/dp/0262042339/ref=as_li_ss_tl?ie=UTF8&camp=1789&creative=390957&creativeASIN=0321928423&linkCode=as2&tag=lesswrong-20">Drescher (2006, p. 238-242)</a>. This scenario is like the original Newcomb’s problem but, in this case, both boxes are transparent so you can see their contents when you make your decision. Again, Omega has filled box A with $1000 and Box B with either $1 million or nothing based on a prediction of your behavior. Specifically, Omega has predicted how you would decide if you witnessed $1 million in Box B. If Omega predicted that you would one-box in this case, he placed $1 million in Box B. On the other hand, if Omega predicted that you would two-box in this case then he placed nothing in Box B.</p>
<p>Both EDT and CDT agents will two-box in this case. After all, the contents of the boxes are determined and known so the agent’s decision can neither provide good news about what they contain nor cause them to contain something desirable. As with two-boxing in the original version of Newcomb’s problem, many philosophers will endorse this behavior.</p>
<p>However, it’s worth noting that Omega will almost certainly have predicted this decision and so filled Box B with nothing. CDT and EDT agents will end up with $1000. On the other hand, just as in the original case, the agent that one-boxes will end up with $1 million. So this is another case where both EDT and CDT “lose”. Consequently, to those that agree with the earlier comments (in section 11.1.1) that a decision theory shouldn’t lead an agent to “lose”, neither of these theories will be satisfactory.</p>
<h4 id="counterfactual-mugging"><a href="#counterfactual-mugging">11.1.9. Counterfactual mugging</a></h4>
<p>Another similar case, known as <em>counterfactual mugging</em>, was developed in <a href="https://www.greaterwrong.com/posts/mg6jDEuQEjBGtibX7/counterfactual-mugging">Nesov (2009)</a>:</p>
<blockquote>
<p>Imagine that one day, Omega comes to you and says that it has just tossed a fair coin, and given that the coin came up tails, it decided to ask you to give it $100. Whatever you do in this situation, nothing else will happen differently in reality as a result. Naturally you don’t want to give up your $100. But see, the Omega tells you that if the coin came up heads instead of tails, it’d give you $10000, but only if you’d agree to give it $100 if the coin came up tails.</p>
</blockquote>
<p>Should you give up the $100?</p>
<p>Both CDT and EDT say no. After all, giving up your money neither provides good news about nor influences your chances of getting $10 000 out of the exchange. Further, this intuitively seems like the right decision. On the face of it, then, it is appropriate to retain your money in this case.</p>
<p>However, presuming you take Omega to be perfectly trustworthy, there seems to be room to debate this conclusion. If you are the sort of agent that gives up the $100 in counterfactual mugging then you will tend to do better than the sort of agent that won’t give up the $100. Of course, in the particular case at hand you will lose but rational agents often lose in specific cases (as, for example, when such an agent loses a rational bet). It could be argued that what a rational agent should not do is be the type of agent that loses. Given that agents that refuse to give up the $100 are the type of agent that loses, there seem to be grounds to claim that counterfactual mugging is another case where both CDT and EDT act inappropriately.</p>
<h4 id="prisoners-dilemma"><a href="#prisoners-dilemma">11.1.10. Prisoner’s dilemma</a></h4>
<p>Before moving on to a more detailed discussion of various possible decision theories, I’ll consider one final scenario: the <em>prisoner’s dilemma</em>. <a href="http://www.amazon.com/Choices-An-Introduction-Decision-Theory/dp/0816614407/ref=as_li_ss_tl?ie=UTF8&camp=1789&creative=390957&creativeASIN=0321928423&linkCode=as2&tag=lesswrong-20">Resnik (1987, pp. 147-148 )</a> outlines this scenario as follows:</p>
<blockquote>
<p>Two prisoners...have been arrested for vandalism and have been isolated from each other. There is sufficient evidence to convict them on the charge for which they have been arrested, but the prosecutor is after bigger game. He thinks that they robbed a bank together and that he can get them to confess to it. He summons each separately to an interrogation room and speaks to each as follows: “I am going to offer the same deal to your partner, and I will give you each an hour to think it over before I call you back. This is it: If one of you confesses to the bank robbery and the other does not, I will see to it that the confessor gets a one-year term and that the other guy gets a twenty-five year term. If you both confess, then it’s ten years apiece. If neither of you confesses, then I can only get two years apiece on the vandalism charge...”</p>
</blockquote>
<p>The decision matrix of each vandal will be as follows:</p>
<table border="0" cellspacing="5" cellpadding="3">
<tbody>
<tr>
<td class="numeric"> </td>
<td><em>Partner confesses</em></td>
<td><em>Partner lies</em></td>
</tr>
<tr>
<td><em>Confess</em></td>
<td>10 years in jail</td>
<td>1 year in jail</td>
</tr>
<tr>
<td><em>Lie</em></td>
<td>25 years in jail</td>
<td>2 years in jail</td>
</tr>
</tbody>
</table>
<p>Faced with this scenario, a CDT agent will confess. After all, the agent’s decision can’t influence their partner’s decision (they’ve been isolated from one another) and so the agent is better off confessing regardless of what their partner chooses to do. According to the majority of decision (and game) theorists, confessing is in fact the rational decision in this case.</p>
<p>Despite this, however, an EDT agent may lie in a prisoner’s dilemma. Specifically, if they think that their partner is similar enough to them, the agent will lie because doing so will provide the good news that they will both lie and hence that they will both get two years in jail (good news as compared with the bad news that they will both confess and hence that they will get 10 years in jail).</p>
<p>To many people, there seems to be something compelling about this line of reasoning. For example, <a href="http://www.amazon.com/Metamagical-Themas-Questing-Essence-Pattern/dp/0465045669/ref=as_li_ss_tl?ie=UTF8&camp=1789&creative=390957&creativeASIN=0321928423&linkCode=as2&tag=lesswrong-20">Douglas Hofstadter (1985, pp. 737-780)</a> has argued that an agent acting “superrationally” would co-operate with other superrational agents for precisely this sort of reason: a superrational agent would take into account the fact that other such agents will go through the same thought process in the <em>prisoner’s dilemma</em> and so make the same decision. As such, it is better that that the decision that both agents reach be to lie than that it be to confess. More broadly, it could perhaps be argued that a rational agent should lie in the <em>prisoner’s dilemma</em> as long as they believe that they are similar enough to their partner that they are likely to reach the same decision.</p>
<div class="figure"><div class="imgonly"><img src="http://i.imgur.com/fPUcm.jpg" alt="An argument for cooperation in the prisoners’ dilemma" loading="lazy"></div>
<p class="caption">An argument for cooperation in the prisoners’ dilemma</p>
</div>
<p>It is unclear, then, precisely what should be concluded from the prisoner’s dilemma. However, for those that are sympathetic to Hofstadter’s point or the line of reasoning appealed to by the EDT agent, the scenario seems to provide an additional reason to seek out an alternative theory to CDT.</p>
<h3 id="benchmark-theory-bt"><a href="#benchmark-theory-bt">11.2. Benchmark theory (BT)</a></h3>
<p>One recent response to the apparent failure of EDT to decide appropriately in medical Newcomb problems and CDT to decide appropriately in the psychopath button is Benchmark Theory (BT) which was developed in <a href="http://www.springerlink.com/content/a66107137n821610/?MUD=MP">Wedgwood (2011)</a> and discussed further in <a href="http://philreview.dukejournals.org/content/119/1/1.abstract">Briggs (2010)</a>.</p>
<p>In English, we could think of this decision algorithm as saying that agents should decide so as to give their future self good news about how well off they are compared to how well off they could have been. In formal terms, BT uses the following formula to calculate the expected utility of an act, A:</p>
<p><div class="imgonly"><img src="http://i.imgur.com/fUjmj.gif" alt="BT expected value formula" loading="lazy"></div>.</p>
<p>In other words, it uses the conditional probability, as in EDT but calculates the value differently (as indicated by the use of V’ rather than V). V’ is calculated relative to a benchmark value in order to give a comparative measure of value (both of the above sources go into more detail about this process).</p>
<p>Taking the informal perspective, in the <em>chewing gum problem</em>, BT will note that by chewing gum, the agent will always get the good news that they are comparatively better off than they could have been (because chewing gum helps control throat abscesses) whereas by not chewing, the agent will always get the bad news that they could have been comparatively better off by chewing. As such, a BT agent will chew in this scenario.</p>
<p>Further, BT seems to reach what many consider to be the right decision in the <em>psychopath button</em>. In this case, the BT agent will note that if they push the button they will get the bad news that they are almost certainly a psychopath and so that they would have been comparatively much better off by not pushing (as pushing will kill them). On the other hand, if they don’t push they will get the less bad news that they are almost certainly not a psychopath and so could have been comparatively a little better off it they had pushed the button (as this would have killed all the psychopaths but not them). So refraining from pushing the button gives the less bad news and so is the rational decision.</p>
<p>On the face of it, then, there seem to be strong reasons to find BT compelling: it decides appropriately in these scenarios while, according to some people, EDT and CDT only decide appropriately in one or the other of them.</p>
<p>Unfortunately, a BT agent will fail to decide appropriately in other scenarios. First, those that hold that one-boxing is the appropriate decision in Newcomb’s problem will immediately find a flaw in BT. After all, in this scenario two-boxing gives the good news that the agent did comparatively better than they could have done (because they gain the $1000 from Box A which is more than they would have received otherwise) while one-boxing brings the bad news that they did comparatively worse than they could have done (as they did not receive this money). As such, a BT agent will two-box in Newcomb’s problem.</p>
<p>Further, <a href="http://philreview.dukejournals.org/content/119/1/1.abstract">Briggs (2010)</a> argues, though <a href="http://www.springerlink.com/content/a66107137n821610/?MUD=MP">Wedgwood (2011)</a> denies, that BT suffers from other problems. As such, even for those who support two-boxing in Newcomb’s problem, it could be argued that BT doesn’t represent an adequate theory of choice. It is unclear, then, whether BT is a desirable replacement to alternative theories.</p>
<h3 id="timeless-decision-theory-tdt"><a href="#timeless-decision-theory-tdt">11.3. Timeless decision theory (TDT)</a></h3>
<p><a href="http://intelligence.org/files/TDT.pdf">Yudkowsky (2010)</a> offers another decision algorithm, <em>timeless decision theory</em> or TDT (see also <a href="http://intelligence.org/files/Comparison.pdf">Altair, 2013</a>). Specifically, TDT is intended as an explicit response to the idea that a theory of rational choice should lead an agent to “win”. As such, it will appeal to those who think it is appropriate to one-box in Newcomb’s problem and chew in the chewing gum problem.</p>
<p>In English, this algorithm can be approximated as saying that an agent ought to choose as if CDT were right but they were determining not their actual decision but rather the result of the abstract computation of which their decision is one concrete instance. Formalizing this decision algorithm would require a substantial document in its own right and so will not be carried out in full here. Briefly, however, TDT is built on top of causal Bayesian networks <a href="http://www.amazon.com/Causality-Reasoning-Inference-Judea-Pearl/dp/0521773628/ref=as_li_ss_tl?ie=UTF8&camp=1789&creative=390957&creativeASIN=0321928423&linkCode=as2&tag=lesswrong-20">(Pearl, 2000)</a> which are graphs where the arrows represent causal influence. TDT supplements these graphs by adding nodes representing abstract computations and taking the abstract computation that determines an agent’s decision to be the object of choice rather than the concrete decision itself (see <a href="http://intelligence.org/files/TDT.pdf">Yudkowsky, 2010</a> for a more detailed description).</p>
<p>Returning to an informal discussion, an example will help clarify the form taken by TDT: imagine that two perfect replicas of a person are placed in identical rooms and asked to make the same decision. While each replica will make their own decision, in doing so, they will be carrying out the same computational process. As such, TDT will say that the replicas ought to act as if they are determining the result of this process and hence as if they are deciding the behavior of both copies.</p>
<p>Something similar can be said about Newcomb’s problem. In this case it is almost like there is again a replica of the agent: Omega’s model of the agent that it used to predict the agent’s behavior. Both the original agent and this “replica” responds to the same abstract computational process as one another. In other words, both Omega’s prediction and the agent’s behavior are influenced by this process. As such, TDT advises the agent to act as if they are determining the result of this process and, hence, as if they can determine Omega’s box filling behavior. As such, a TDT agent will one-box in order to determine the result of this abstract computation in a way that leads to $1 million being placed in Box B.</p>
<p>TDT also succeeds in other areas. For example, in the chewing gum problem there is no “replica” agent so TDT will decide in line with standard CDT and choose to chew gum. Further, in the prisoner’s dilemma, a TDT agent will lie if its partner is another TDT agent (or a relevantly similar agent). After all, in this case both agents will carry out the same computational process and so TDT will advise that the agent act as if they are determining this process and hence simultaneously determining both their own and their partner’s decision. If so then it is better for the agent that both of them lie than that both of them confess.</p>
<p>However, despite its success, TDT also “loses” in some decision scenarios. For example, in counterfactual mugging, a TDT agent will not choose to give up the $100. This might seem surprising. After all, as with Newcomb’s problem, this case involves Omega predicting the agent’s behavior and hence involves a “replica”. However, this case differs in that the agent knows that the coin came up heads and so knows that they have nothing to gain by giving up the money.</p>
<p>For those who feel that a theory of rational choice should lead an agent to “win”, then, TDT seems like a step in the right direction but further work is required if it is to “win” in the full range of decision scenarios.</p>
<h3 id="decision-theory-and-winning"><a href="#decision-theory-and-winning">11.4. Decision theory and “winning”</a></h3>
<p>In the previous section, I discussed TDT, a decision algorithm that could be advanced as replacements for CDT and EDT. One of the primary motivations for developing TDT is a sense that both CDT and EDT fail to reason in a desirable manner in some decision scenarios. However, despite acknowledging that CDT agents end up worse off in Newcomb’s Problem, many (and perhaps the majority of) decision theorists are proponents of CDT. On the face of it, this may seem to suggest that these decision theorists aren’t interested in developing a decision algorithm that “wins” but rather have some other aim in mind. If so then this might lead us to question the value of developing one-boxing decision algorithms.</p>
<p>However, the claim that most decision theorists don’t care about finding an algorithm that “wins” mischaracterizes their position. After all, proponents of CDT tend to take the challenge posed by the fact that CDT agents “lose” in Newcomb’s problem seriously (in the philosophical literature, it’s often referred to as the <em>Why ain’cha rich?</em> problem). A common reaction to this challenge is neatly summarized in <a href="http://www.amazon.com/Foundations-Decision-Cambridge-Probability-Induction/dp/0521641640/ref=as_li_ss_tl?ie=UTF8&camp=1789&creative=390957&creativeASIN=0321928423&linkCode=as2&tag=lesswrong-20">Joyce (1999, p. 153-154 )</a> as a response to a hypothetical question about why, if two-boxing is rational, the CDT agent does not end up as rich as an agent that one-boxes:</p>
<blockquote>
<p>Rachel has a perfectly good answer to the “Why ain’t you rich?” question. “I am not rich,” she will say, “because I am not the kind of person [Omega] thinks will refuse the money. I’m just not like you, Irene [the one-boxer]. Given that I know that I am the type who takes the money, and given that [Omega] knows that I am this type, it was reasonable of me to think that the $1,000,000 was not in [the box]. The $1,000 was the most I was going to get no matter what I did. So the only reasonable thing for me to do was to take it.”</p>
</blockquote>
<blockquote>
<p>Irene may want to press the point here by asking, “But don’t you wish you were like me, Rachel?”… Rachel can and should admit that she <em>does</em> wish she were more like Irene… At this point, Irene will exclaim, “You’ve admitted it! It wasn’t so smart to take the money after all.” Unfortunately for Irene, her conclusion does not follow from Rachel’s premise. Rachel will patiently explain that wishing to be a [one-boxer] in a Newcomb problem is not inconsistent with thinking that one should take the $1,000 <em>whatever type one is</em>. When Rachel wishes she was Irene’s type she is wishing for <em>Irene’s options</em>, not sanctioning her choice… While a person who knows she will face (has faced) a Newcomb problem might wish that she were (had been) the type that [Omega] labels a [one-boxer], this wish does not provide a reason for <em>being</em> a [one-boxer]. It might provide a reason to try (before [the boxes are filled]) to change her type <em>if she thinks this might affect [Omega’s] prediction</em>, but it gives her no reason for doing anything other than taking the money once she comes to believes that she will be unable to influence what [Omega] does.</p>
</blockquote>
<p>In other words, this response distinguishes between the <em>winning decision</em> and the <em>winning type of agent</em> and claims that two-boxing is the winning decision in Newcomb’s problem (even if one-boxers are the winning type of agent). Consequently, insofar as decision theory is about determining which <em>decision</em> is rational, on this account CDT reasons correctly in Newcomb’s problem.</p>
<p>For those that find this response perplexing, an analogy could be drawn to the <em>chewing gum problem</em>. In this scenario, there is near unanimous agreement that the rational decision is to chew gum. However, statistically, non-chewers will be better off than chewers. As such, the non-chewer could ask, “if you’re so smart, why aren’t you healthy?” In this case, the above response seems particularly appropriate. The chewers are less healthy not because of their decision but rather because they’re more likely to have an undesirable gene. Having good genes doesn’t make the non-chewer more rational but simply more lucky. The proponent of CDT simply makes a similar response to Newcomb’s problem: one-boxers aren’t richer because of their decision but rather because of the type of agent that they were when the boxes were filled.</p>
<p>One final point about this response is worth noting. A proponent of CDT can accept the above argument but still acknowledge that, if given the choice before the boxes are filled, they would be rational to choose to modify themselves to be a one-boxing type of agent (as Joyce acknowledged in the above passage and as argued for in <a href="http://www.jstor.org/stable/20118389">Burgess, 2004</a>). To the proponent of CDT, this is unproblematic: if we are sometimes rewarded not for the rationality of our decisions in the moment but for the type of agent we were at some past moment then it should be unsurprising that changing to a different type of agent might be beneficial.</p>
<p>The response to this defense of two-boxing in Newcomb’s problem has been divided. Many find it compelling but others, like <a href="http://link.springer.com/article/10.1007%2Fs10670-011-9355-2">Ahmed and Price (2012)</a> think it does not adequately address to the challenge:</p>
<blockquote>
<p>It is no use the causalist’s whining that foreseeably, Newcomb problems do in fact reward irrationality, or rather CDT-irrationality. The point of the argument is that if everyone knows that the CDT-irrational strategy will in fact do better on average than the CDT-rational strategy, then it’s rational to play the CDT-irrational strategy.</p>
</blockquote>
<p>Given this, there seem to be two positions one could take on these issues. If the response given by the proponent of CDT is compelling, then we should be attempting to develop a decision theory that two-boxes on Newcomb’s problem. Perhaps the best theory for this role is CDT but perhaps it is instead BT, which many people think reasons better in the psychopath button scenario. On the other hand, if the response given by the proponents of CDT is not compelling, then we should be developing a theory that one-boxes in Newcomb’s problem. In this case, TDT, or something like it, seems like the most promising theory currently on offer.</p>lukeprogzEWJBFFMvQ835nq6hThu, 28 Feb 2013 14:15:55 +0000Beyond Bayesians and Frequentists by jsteinhardt
https://www.greaterwrong.com/posts/o32tEFf5zBiByL2xv/beyond-bayesians-and-frequentists
<p>(Note: this is cross-posted from my <a href="http://jsteinhardt.wordpress.com/2012/10/31/beyond-bayesians-and-frequentists/">blog</a> and also available in pdf <a href="http://web.mit.edu/jsteinha/www/stats-essay.pdf">here</a>.)</p>
<p>If you are a newly initiated student into the field of machine learning, it won’t be long before you start hearing the words “Bayesian” and “frequentist” thrown around. Many people around you probably have strong opinions on which is the “right” way to do statistics, and within a year you’ve probably developed your own strong opinions (which are suspiciously similar to those of the people around you, despite there being a much greater variance of opinion between different labs). In fact, now that the year is 2012 the majority of new graduate students are being raised as Bayesians (at least in the U.S.) with frequentists thought of as stodgy emeritus professors stuck in their ways.</p>
<p>If you are like me, the preceding set of facts will make you very uneasy. They will make you uneasy because simple pattern-matching—the strength of people’s opinions, the reliability with which these opinions split along age boundaries and lab boundaries, and the ridicule that each side levels at the other camp – makes the “Bayesians vs. frequentists” debate look far more like politics than like scholarly discourse. Of course, that alone does not necessarily prove anything; these disconcerting similarities could just be coincidences that I happened to cherry-pick.</p>
<p>My next point, then, is that we are right to be uneasy, because such debate makes us less likely to evaluate the strengths and weaknesses of both approaches in good faith. This essay is a push against that—I summarize the justifications for Bayesian methods and where they fall short, show how frequentist approaches can fill in some of their shortcomings, and then present my personal (though probably woefully under-informed) guidelines for choosing which type of approach to use.</p>
<p>Before doing any of this, though, a bit of background is in order...</p>
<p><strong>1. Background on Bayesians and Frequentists</strong></p>
<p><strong>1.1. Three Levels of Argument</strong></p>
<p>As Andrew Critch [6] insightfully points out, the Bayesians vs. frequentists debate is really three debates at once, centering around one or more of the following arguments:</p>
<ol><li><p>Whether to interpret subjective beliefs as probabilities</p></li><li><p>Whether to interpret probabilities as subjective beliefs (as opposed to asymptotic frequencies)</p></li><li><p>Whether a Bayesian or frequentist algorithm is better suited to solving a particular problem.</p></li></ol>
<p>Given my own research interests, I will add a fourth argument:</p>
<p>4. Whether Bayesian or frequentist techniques are better suited to engineering an artificial intelligence.</p>
<p>Andrew Gelman [9] has his own well-written essay on the subject, where he expands on these distinctions and presents his own more nuanced view.</p>
<p>Why are these arguments so commonly conflated? I’m not entirely sure; I would guess it is for historical reasons but I have so far been unable to find said historical reasons. Whatever the reasons, what this boils down to in the present day is that people often form opinions on 1. and 2., which then influence their answers to 3. and 4. This is <em>not good</em>, since 1. and 2. are philosophical in nature and difficult to resolve correctly, whereas 3. and 4. are often much easier to resolve and extremely important to resolve correctly in practice. Let me re-iterate: <em>the Bayes vs. frequentist discussion should center on the practical employment of the two methods, or, if epistemology must be discussed, it should be clearly separated from the day-to-day practical decisions</em>. Aside from the difficulties with correctly deciding epistemology, the relationship between generic epistemology and specific practices in cutting-edge statistical research is only via a long causal chain, and it should be completely unsurprising if Bayesian epistemology leads to the employment of frequentist tools or vice versa.<a id="more"></a></p>
<p>For this reason and for reasons of space, I will spend the remainder of the essay focusing on <em>statistical algorithms</em> rather than on <em>interpretations of probability</em>. For those who really want to discuss interpretations of probability, I will address that in a later essay.</p>
<p><strong>1.2. Recap of Bayesian Decision Theory</strong></p>
<p>(What follows will be review for many.) In Bayesian decision theory, we assume that there is some underlying world state θ and a <em>likelihood function</em> p(X1,...,Xn | θ) over possible observations. (A <em>likelihood function</em> is just a conditional probability distribution where the parameter conditioned on can vary.) We also have a space A of possible actions and a utility function U(θ; a) that gives the utility of performing action a if the underlying world state is θ. We can incorporate notions like planning and value of information by defining U(θ; a) recursively in terms of an identical agent to ourselves who has seen one additional observation (or, if we are planning against an adversary, in terms of the adversary). For a more detailed overview of this material, see the tutorial by North [11].</p>
<p>What distinguishes the Bayesian approach in particular is one additional assumption, a <em>prior distribution</em> p(θ) over possible world states. To make a decision with respect to a given prior, we compute the posterior distribution p<sub>posterior</sub>(θ | X1,...,Xn) using Bayes’ theorem, then take the action a that maximizes <div class="imgonly"><img src="http://www.codecogs.com/png.latex?\mathbb{E}_{p_{\mathrm{posterior}}}[U(\theta;+a)]" alt="" loading="lazy"></div>.</p>
<p>In practice, p<sub>posterior</sub>(θ | X1,...,Xn) can be quite difficult to compute, and so we often attempt to approximate it. Such attempts are known as <em>approximate inference algorithms</em>.</p>
<p><strong>1.3. Steel-manning Frequentists</strong></p>
<p>There are many different ideas that fall under the broad umbrella of frequentist techniques. While it would be impossible to adequately summarize all of them even if I attempted to, there are three in particular that I would like to describe, and which I will call <em>frequentist decision theory</em>, <em>frequentist guarantees</em>, and <em>frequentist analysis tools</em>.</p>
<p>Frequentist decision theory has a very similar setup to Bayesian decision theory, with a few key differences. These are discussed in detail and contrasted with Bayesian decision theory in [10], although we summarize the differences here. There is still a likelihood function p(X1,...,Xn | θ) and a utility function U(θ; a). However, we do not assume the existence of a prior on θ, and instead choose the decision rule a(X1,...,Xn) that maximizes</p>
<p class="imgonly"> <div class="imgonly"><img src="http://www.codecogs.com/png.latex?\displaystyle+\min\limits_{\theta}+\mathbb{E}[U(a(X_1,\ldots,X_n);+\theta)+\mid+\theta].+\+\+\+\+\+(1)" alt="" loading="lazy"></div></p>
<p>In other words, we ask for a worst case guarantee rather than an average case guarantee. As an example of how these would differ, imagine a scenario where we have no data to observe, an unknown θ in {1,...,N}, and we choose an action a in {0,...,N}. Furthermore, U(0; θ) = 0 for all θ, U(a; θ) = −1 if a = θ, and U(a;θ) = 1 if a ≠ 0 and a ≠ θ. Then a frequentist will always choose a = 0 because any other action gets −1 utility in the worst case; a Bayesian, on the other hand, will happily choose any non-zero value of a since such an action gains (N-2)/N utility in expectation. (I am purposely ignoring more complex ideas like mixed strategies for the purpose of illustration.).</p>
<p>Note that the frequentist optimization problem is more complicated than in the Bayesian case, since the value of (1) depends on the joint behavior of a(X1,...,Xn), whereas with Bayes we can optimize a(X1,...,Xn) for each set of observations separately.</p>
<p>As a result of this more complex optimization problem, it is often not actually possible to maximize (1), so many frequentist techniques instead develop tools to lower-bound (1) for a given decision procedure, and then try to construct a decision procedure that is reasonably close to the optimum. Support vector machines [2], which try to pick separating hyperplanes that minimize generalization error, are one example of this where the algorithm is explicitly trying to maximize worst-case utility. Another example of a frequentist decision procedure is L1-regularized least squares for sparse recovery [3], where the procedure itself does not look like it is explicitly maximizing any utility function, but a separate analysis shows that it is close to the optimal procedure anyways.</p>
<p>The second sort of frequentist approach to statistics is what I call a <em>frequentist guarantee</em>. A frequentist guarantee on an algorithm is a guarantee that, with high probability with respect to how the data was generated, the output of the algorithm will satisfy a given property. The most familiar example of this is any algorithm that generates a frequentist confidence interval: to generate a 95% frequentist confidence interval for a parameter θ is to run an algorithm that outputs an interval, such that with probability at least 95% θ lies within the interval. An important fact about most such algorithms is that the size of the interval only grows logarithmically with the amount of confidence we require, so getting a 99.9999% confidence interval is only slightly harder than getting a 95% confidence interval (and we should probably be asking for the former whenever possible).</p>
<p>If we use such algorithms to test hypotheses or to test discrete properties of θ, then we can obtain algorithms that take in probabilistically generated data and produce an output that with high probability depends <em>only on how the data was generated</em>, not on the specific random samples that were given. For instance, we can create an algorithm that takes in samples from two distributions, and is guaranteed to output 1 whenever they are the same, 0 whenever they differ by at least ε in total variational distance, and could have arbitrary output if they are different but the total variational distance is less than ε. This is an amazing property—it takes in random input and produces an essentially deterministic answer.</p>
<p>Finally, a third type of frequentist approach seeks to construct <em>analysis tools</em> for understanding the behavior of random variables. Metric entropy, the Chernoff and Azuma-Hoeffding bounds [12], and Doob’s optional stopping theorem are representative examples of this sort of approach. Arguably, everyone with the time to spare should master these techniques, since being able to analyze random variables is important no matter what approach to statistics you take. Indeed, frequentist analysis tools have no conflict at all with Bayesian methods—they simply provide techniques for understanding the behavior of the Bayesian model.</p>
<p><strong>2. Bayes vs. Other Methods</strong></p>
<p><strong>2.1. Justification for Bayes</strong></p>
<p>We presented Bayesian decision theory above, but are there any reasons why we should actually use it? One commonly-given reason is that Bayesian statistics is merely the application of Bayes’ Theorem, which, being a theorem, describes the only correct way to update beliefs in response to new evidence; anything else can only be justified to the extent that it provides a good approximation to Bayesian updating. This may be true, but Bayes’ Theorem only applies if we already have a prior, and if we accept probability as the correct framework for expressing uncertain beliefs. We might want to avoid one or both of these assumptions. Bayes’ theorem also doesn’t explain why we care about expected utility as opposed to some other statistic of the distribution over utilities (although note that frequentist decision theory also tries to maximize expected utility).</p>
<p>One compelling answer to this is <strong>dutch-booking</strong>, which shows that any agent must implicitly be using a probability model to make decisions, or else there is a series of bets that they would be willing to make that causes them to lose money with certainty. Another answer is the <strong>complete class theorem</strong>, which shows that any non-Bayesian decision procedure is <em>strictly dominated</em> by a Bayesian decision procedure—meaning that the Bayesian procedure performs at least as well as the non-Bayesian procedure in all cases with certainty. In other words, if you are doing anything non-Bayesian, then either it is secretly a Bayesian procedure or there is another procedure that does strictly better than it. Finally, the <strong>VNM Utility Theorem</strong> states that any agent with consistent preferences over distributions of outcomes must be implicitly maximizing the expected value of some scalar-valued function, which we can then use as our choice of utility function U. These theorems, however, ignore the issue of computation—while the best decision procedure may be Bayesian, the best computationally-efficient decision procedure could easily be non-Bayesian.</p>
<p>Another justification for Bayes is that, in contrast to ad hoc frequentist techniques, it actually provides a general theory for constructing statistical algorithms, as well as for incorporating side information such as expert knowledge. Indeed, when trying to model complex and highly structured situations it is difficult to obtain any sort of frequentist guarantees (although analysis tools can still often be applied to gain intuition about parts of the model). A prior lets us write down the sorts of models that would allow us to capture structured situations (for instance, when trying to do language modeling or transfer learning). Non-Bayesian methods exist for these situations, but they are often ad hoc and in many cases ends up looking like an approximation to Bayes. One example of this is Kneser-Ney smoothing for n-gram models, an ad hoc algorithm that ended up being very similar to an approximate inference algorithm for the hierarchical Pitman-Yor process [15, 14, 17, 8]. This raises another important point <em>against</em> Bayes, which is that the proper Bayesian interpretation may be very mathematically complex. Pitman-Yor processes are on the cutting-edge of Bayesian nonparametric statistics, which is itself one of the more technical subfields of statistical machine learning, so it was probably much easier to come up with Kneser-Ney smoothing than to find the interpretation in terms of Pitman-Yor processes.</p>
<p><strong>2.2. When the Justifications Fail</strong></p>
<p>The first and most common objection to Bayes is that a Bayesian method is only as good as its prior. While for simple models the performance of Bayes is relatively independent of the prior, such models can only capture data where frequentist techniques would also perform very well. For more complex (especially nonparametric) Bayesian models, the performance can depend strongly on the prior, and designing good priors is still an open problem. As one example I point to my own research on hierarchical nonparametric models, where the most straightforward attempts to build a hierarchical model lead to severe pathologies [13].</p>
<p>Even if a Bayesian model does have a good prior, it may be computationally intractable to perform posterior inference. For instance, structure learning in Bayesian networks is NP-hard [4], as is topic inference in the popular latent Dirichlet allocation model (and this continues to hold even if we only want to perform approximate inference). Similar stories probably hold for other common models, although a theoretical survey has yet to be made; suffice to say that in practice approximate inference remains a difficult and unsolved problem, with many models not even considered because of the apparent hopelessness of performing inference in them.</p>
<p>Because frequentist methods often come with an analysis of the specific algorithm being employed, they can sometimes overcome these computational issues. One example of this mentioned already is L1 regularized least squares [3]. The problem setup is that we have a linear regression task Ax = b+v where A and b are known, v is a noise vector, and x is believed to be sparse (typically x has many more rows than b, so without the sparsity assumption x would be underdetermined). Let us suppose that x has n rows and k non-zero rows—then the number of possible sparsity patterns is <div class="imgonly"><img src="http://www.codecogs.com/png.latex?\binom{n}{k}" alt="" loading="lazy"></div> --- large enough that a brute force consideration of all possible sparsity patterns is intractable. However, we can show that solving a certain semidefinite program will with high probability yield the appropriate sparsity pattern, after which recovering x reduces to a simple least squares problem. (A <em>semidefinite program</em> is a certain type of optimization problem that can be solved efficiently [16].)</p>
<p>Finally, Bayes has no good way of dealing with adversaries or with cases where the data was generated in a complicated way that could make it highly biased (for instance, as the output of an optimization procedure). A toy example of an adversary would be playing rock-paper-scissors—how should a Bayesian play such a game? The straightforward answer is to build up a model of the opponent based on their plays so far, and then to make the play that maximizes the expected score (probability of winning minus probability of losing). However, such a strategy fares poorly against any opponent with access to the model being used, as they can then just run the model themselves to predict the Bayesian’s plays in advance, thereby winning every single time. In contrast, there is a frequentist strategy called the <strong>multiplicative weights update method</strong> that fairs well against an arbitrary opponent (even one with superior computational resources and access to our agent’s source code). The multiplicative weights method does far more than winning at rock-paper-scissors—it is also a key component of the fastest algorithm for solving many important optimization problems (including the network flow algorithm), and it forms the theoretical basis for the widely used AdaBoost algorithm [1, 5, 7].</p>
<p><strong>2.3. When To Use Each Method</strong></p>
<p>The essential difference between Bayesian and frequentist decision theory is that Bayes makes the additional assumption of a prior over θ, and optimizes for average-case performance rather than worst-case performance. <em>It follows, then, that Bayes is the superior method whenever we can obtain a good prior and when good average-case performance is sufficient.</em> However, if we have no way of obtaining a good prior, or when we need guaranteed performance, frequentist methods are the way to go. For instance, if we are trying to build a software package that should be widely deployable, we might want to use a frequentist method because users can be sure that the software will work as long as some number of easily-checkable assumptions are met.</p>
<p>A nice middle-ground between purely Bayesian and purely frequentist methods is to use a Bayesian model coupled with frequentist model-checking techniques; this gives us the freedom in modeling afforded by a prior but also gives us some degree of confidence that our model is correct. This approach is suggested by both Gelman [9] and Jordan [10].</p>
<p><strong>3. Conclusion</strong></p>
<p>When the assumptions of Bayes’ Theorem hold, and when Bayesian updating can be performed computationally efficiently, then it is indeed tautological that Bayes is the optimal approach. Even when some of these assumptions fail, Bayes can still be a fruitful approach. However, by working under weaker (sometimes even adversarial) assumptions, frequentist approaches can perform well in very complicated domains even with fairly simple models; this is because, with fewer assumptions being made at the outset, less work has to be done to ensure that those assumptions are met.</p>
<p>From a research perspective, we should be far from satisfied with either approach—Bayesian methods make stronger assumptions than may be warranted, and frequentists methods provide little in the way of a coherent framework for constructing models, and ask for worst-case guarantees, which probably cannot be obtained in general. We should seek to develop a statistical modeling framework that, unlike Bayes, can deal with unknown priors, adversaries, and limited computational resources.</p>
<p><strong>4. Acknowledgements</strong></p>
<p>Thanks to Emma Pierson, Vladimir Slepnev, and Wei Dai for reading preliminary versions of this work and providing many helpful comments.</p>
<p><strong>5. References</strong><strong> </strong></p>
<p>[1] Sanjeev Arora, Elad Hazan, and Satyen Kale. The multiplicative weights update method: a meta algorithm and applications. <em>Working Paper</em>, 2005.</p>
<p>[2] Christopher J.C. Burges. A tutorial on support vector machines for pattern recognition. <em>Data Mining and Knowledge Discovery</em>, 2:121--167, 1998.</p>
<p>[3] Emmanuel J. Candes. Compressive sampling. In <em>Proceedings of the International Congress of Mathematicians</em>. European Mathematical Society, 2006.</p>
<p>[4] D.M. Chickering. Learning bayesian networks is NP-complete. <em>LECTURE NOTES IN STATISTICS-NEW YORK-SPRINGER VERLAG-</em>, pages 121--130, 1996.</p>
<p>[5] Paul Christiano, Jonathan A. Kelner, Aleksander Madry, Daniel Spielman, and Shang-Hua Teng. Electrical flows, laplacian systems, and faster approximation of maximum flow in undirected graphs. In <em>Proceedings of the 43rd ACM Symposium on Theory of Computing</em>, 2011.</p>
<p>[6] Andrew Critch. Frequentist vs. bayesian breakdown: Interpretation vs. inference. <a href="https://www.greaterwrong.com/posts/mQfNymou9q5riEKrf/frequentist-vs-bayesian-breakdown-interpretation-vs" class="bare-url">http://lesswrong.com/lw/7ck/frequentist_vs_bayesian_breakdown_interpretation/</a>.</p>
<p>[7] Yoav Freund and Robert E. Schapire. A short introduction to boosting. <em>Journal of Japanese Society for Artificial Intelligence</em>, 14(5):771--780, Sep. 1999.</p>
<p>[8] J. Gasthaus and Y.W. Teh. Improvements to the sequence memoizer. In <em>Advances in Neural Information Processing Systems</em>, 2011.</p>
<p>[9] Andrew Gelman. Induction and deduction in bayesian data analysis. <em>RMM</em>, 2:67--78, 2011.</p>
<p>[10] Michael I. Jordan. Are you a bayesian or a frequentist? Machine Learning Summer School 2009 (video lecture at <a href="http://videolectures.net/mlss09uk_jordan_bfway/" class="bare-url">http://videolectures.net/mlss09uk_jordan_bfway/</a>).</p>
<p>[11] D. Warner North. A tutorial introduction to decision theory. <em>IEEE Transactions on Systems Science and Cybernetics</em>, SSC-4(3):200--210, Sep. 1968.</p>
<p>[12] Igal Sason. On refined versions of the Azuma-Hoeffding inequality with applications in information theory. <em>CoRR</em>, abs/1111.1977, 2011.</p>
<p>[13] Jacob Steinhardt and Zoubin Ghahramani. Pathological properties of deep bayesian hierarchies. In <em>NIPS Workshop on Bayesian Nonparametrics</em>, 2011. Extended Abstract.</p>
<p>[14] Y.W. Teh. A bayesian interpretation of interpolated Kneser-Ney. Technical Report TRA2/06, School of Computing, NUS, 2006.</p>
<p>[15] Y.W. Teh. A hierarchical bayesian language model based on pitman-yor processes. <em>Coling/ACL</em>, 2006.</p>
<p>[16] Lieven Vandenberghe and Stephen Boyd. Semidefinite programming. <em>SIAM Review</em>, 38(1):49--95, Mar. 1996.</p>
<p>[17] F.~Wood, C.~Archambeau, J.~Gasthaus, L.~James, and Y.W. Teh. A stochastic memoizer for sequence data. In <em>Proceedings of the 26th International Conference on Machine Learning</em>, pages 1129--1136, 2009.</p>
jsteinhardto32tEFf5zBiByL2xvWed, 31 Oct 2012 07:03:00 +0000Bayes’ Theorem Illustrated (My Way) by komponisto
https://www.greaterwrong.com/posts/CMt3ijXYuCynhPWXa/bayes-theorem-illustrated-my-way
<p><em>(This post is elementary: it introduces a simple method of visualizing Bayesian calculations. In my defense, we’ve had <a href="https://www.greaterwrong.com/posts/AN2cBr6xKWCB8dRQG/what-is-bayesianism">other</a> elementary posts before, and they’ve been found useful; plus, I’d really like this to be online somewhere, and it might as well be here.)</em></p>
<p>I’ll admit, those <a href="http://en.wikipedia.org/wiki/Monty_Hall_problem">Monty-Hall</a>-<a href="https://www.greaterwrong.com/posts/Hug2ePykMkmPzSsx6/drawing-two-aces">type</a> problems invariably trip me up. Or at least, they do if I’m not thinking <em>very</em> carefully—doing quite a bit more work than other people seem to have to do.</p>
<p>What’s more, people’s explanations of how to get the right answer have almost never been satisfactory to me. If I concentrate hard enough, I can usually follow the reasoning, sort of; but I never quite “see it”, and nor do I feel equipped to solve similar problems in the future: it’s as if the solutions seem to work only in retrospect. </p>
<p><a href="https://www.greaterwrong.com/posts/baTWMegR42PAsH9qJ/generalizing-from-one-example">Minds work differently</a>, <a href="https://www.greaterwrong.com/posts/sSqoEw9eRP2kPKLCz/illusion-of-transparency-why-no-one-understands-you">illusion of transparency</a>, and all that.</p>
<p>Fortunately, I eventually managed to identify the source of the problem, and I came up a way of thinking about—<em>visualizing—</em>such problems that suits my own intuition. Maybe there are others out there like me; this post is for them.</p>
<p>I’ve <a href="http://wiki.lesswrong.com/wiki/Chat_Logs/2010-02-18">mentioned before</a> that I like to think in very abstract terms. What this means in practice is that, if there’s some simple, general, elegant point to be made, <em>tell it to me right away</em>. Don’t start with some messy concrete example and attempt to “work upward”, in the hope that difficult-to-grasp abstract concepts will be made more palatable by relating them to “real life”. If you do that, I’m liable to get stuck in the trees and not see the forest. Chances are, I won’t have much trouble understanding the abstract concepts; “real life”, on the other hand...</p>
<p>...well, let’s just say I prefer to start at the top and work downward, as a general rule. Tell me how the trees relate to the forest, rather than the other way around.</p>
<p>Many people have found Eliezer’s <a href="http://yudkowsky.net/rational/bayes">Intuitive Explanation of Bayesian Reasoning</a> to be an excellent introduction to <a href="http://wiki.lesswrong.com/wiki/Bayes%27_theorem">Bayes’ theorem</a>, and so I don’t usually hesitate to recommend it to others. But for me personally, if I didn’t know Bayes’ theorem and you were trying to explain it to me, pretty much the worst thing you could do would be to start with some detailed scenario involving breast-cancer screenings. (And not just because it tarnishes beautiful mathematics with images of sickness and death, either!)</p>
<p>So what’s the right way to explain Bayes’ theorem to me?</p>
<p>Like this:</p>
<p>We’ve got a bunch of hypotheses (states the world could be in) and we’re trying to figure out which of them is true (that is, which state the world is actually in). As a concession to concreteness (and for ease of drawing the pictures), let’s say we’ve got three (mutually exclusive and exhaustive) hypotheses—possible world-states—which we’ll call H<sub>1</sub>, H<sub>2</sub>, and H<sub>3</sub>. We’ll represent these as blobs in space:</p>
<p class="imgonly" style="--aspect-ratio: 0.5421687; max-width: 225px"><img src="http://imgur.com/NpNUV.png" alt="Figure 0" loading="lazy"></p>
<p><strong> Figure 0</strong></p>
<p><br>Now, we have some prior notion of how probable each of these hypotheses is—that is, each has some <em>prior probability</em>. If we don’t know anything at all that would make one of them more probable than another, they would each have probability <span class="frac"><sup>1</sup>⁄<sub>3</sub></span>. To illustrate a more typical situation, however, let’s assume we have more information than that. Specifically, let’s suppose our prior probability distribution is as follows: P(H<sub>1</sub>) = 30%, P(H<sub>2</sub>)=50%, P(H<sub>3</sub>) = 20%. We’ll represent this by resizing our blobs accordingly:</p>
<p class="imgonly" style="--aspect-ratio: 0.63227016; max-width: 337px"><img src="http://i.imgur.com/8JAkA.png" alt="Figure 1" loading="lazy"></p>
<p> <strong>Figure 1<br></strong></p>
<p>That’s our <em>prior</em> knowledge. Next, we’re going to collect some <em>evidence</em> and <em>update</em> our prior probability distribution to produce a <em>posterior</em> probability distribution. Specifically, we’re going to run a test. The test we’re going to run has three possible outcomes: Result A, Result B, and Result C. Now, since this test happens to have three possible results, it would be really nice if the test just flat-out told us which world we were living in—that is, if (say) Result A meant that H<sub>1</sub> was true, Result B meant that H<sub>2</sub> was true, and Result 3 meant that H<sub>3</sub> was true. Unfortunately, the real world is messy and complex, and things aren’t that simple. Instead, we’ll suppose that each result can occur under each hypothesis, but that the different hypotheses have different effects on how likely each result is to occur. We’ll assume for instance that if Hypothesis H<sub>1</sub> is true, we have a <span class="frac"><sup>1</sup>⁄<sub>2</sub></span> chance of obtaining Result A, a <span class="frac"><sup>1</sup>⁄<sub>3</sub></span> chance of obtaining Result B, and a <span class="frac"><sup>1</sup>⁄<sub>6</sub></span> chance of obtaining Result C; which we’ll write like this:</p>
<p>P(A|H<sub>1</sub>) = 50%, P(B|H<sub>1</sub>) = 33.33...%, P(C|H<sub>1</sub>) = 16.166...%</p>
<p>and illustrate like this:</p>
<p class="imgonly" style="--aspect-ratio: 2.7428572; max-width: 384px"><img src="http://imgur.com/9jpzJ.png" alt="" loading="lazy"></p>
<p> <strong>Figure 2</strong></p>
<p>(Result A being represented by a triangle, Result B by a square, and Result C by a pentagon.)</p>
<p>If Hypothesis H<sub>2</sub> is true, we’ll assume there’s a 10% chance of Result A, a 70% chance of Result B, and a 20% chance of Result C:</p>
<p class="imgonly" style="--aspect-ratio: 2.4055555; max-width: 433px"><img src="http://imgur.com/puWW1.png" alt="Figure 3" loading="lazy"></p>
<p> <strong>Figure 3</strong></p>
<p><strong><br></strong>(P(A|H<sub>2</sub>) = 10% , P(B|H<sub>2</sub>) = 70%, P(C|H<sub>2</sub>) = 20%)<strong><br></strong></p>
<p>Finally, we’ll say that if Hypothesis H<sub>3</sub> is true, there’s a 5% chance of Result A, a 15% chance of Result B, and an 80% chance of Result C:<strong><br></strong></p>
<p class="imgonly" style="--aspect-ratio: 2.7428572; max-width: 384px"><img src="http://imgur.com/DHitn.png" alt="Figure 4" loading="lazy"></p>
<p> <strong>Figure 4</strong></p>
<p>(P(A|H<sub>3</sub>) = 5%, P(B|H<sub>3</sub>) = 15% P(C|H<sub>3</sub>) = 80%)</p>
<p>Figure 5 below thus shows our knowledge prior to running the test:</p>
<p class="imgonly"><img src="http://imgur.com/qlyGw.png" alt="" loading="lazy"></p>
<p> <strong>Figure 5</strong></p>
<p>Note that we have now carved up our hypothesis-space more finely; our possible world-states are now things like “Hypothesis H<sub>1</sub> is true and Result A occurred”, “Hypothesis H<sub>1</sub> is true and Result B occurred”, etc., as opposed to merely “Hypothesis H<sub>1</sub> is true”, etc. The numbers above the slanted line segments—the <em>likelihoods</em> of the test results, assuming the particular hypothesis—represent <em>what proportion</em> of the total probability mass assigned to the hypothesis H<sub>n</sub> is assigned to the conjunction of Hypothesis H<sub>n</sub> and Result X; thus, since P(H<sub>1</sub>) = 30%, and P(A|H<sub>1</sub>) = 50%, P(H<sub>1</sub> & A) is therefore 50% of 30%, or, in other words, 15%.</p>
<p>(That’s really all Bayes’ theorem is, right there, but—shh! -- don’t tell anyone yet!)</p>
<p><br>Now, then, suppose we run the test, and we get...Result A.</p>
<p>What do we do? We <em>cut off all the other branches</em>:</p>
<p class="imgonly"><img src="http://imgur.com/XBXi5.png" alt="" loading="lazy"></p>
<p> <strong>Figure 6</strong></p>
<p>So our updated probability distribution now looks like this:</p>
<p class="imgonly"><img src="http://imgur.com/nXENh.png" alt="" loading="lazy"></p>
<p><strong> Figure 7</strong></p>
<p>...except for one thing: probabilities are supposed to add up to 100%, not 21%. Well, since we’ve <em>conditioned</em> on Result A, that means that the 21% probability mass assigned to Result A is now the entirety of our probability mass -- 21% is the new 100%, you might say. So we simply adjust the numbers in such a way that they add up to 100% <em>and the proportions are the same</em>:</p>
<p class="imgonly" style="--aspect-ratio: 0.4676525; max-width: 253px"><img src="http://i.imgur.com/RIeff.png" alt="" loading="lazy"></p>
<p><strong> Figure 8</strong></p>
<p>There! We’ve just performed a Bayesian update. And that’s what it <em>looks like</em>.</p>
<p>If, instead of Result A, we had gotten Result B,</p>
<p class="imgonly" style="--aspect-ratio: 0.9583333; max-width: 460px"><img src="http://2.bp.blogspot.com/_Ig9I_03TGBQ/TAXmNUu1BpI/AAAAAAAAAAM/s9iVIdtmPy0/s1600/figure09.png" alt="Figure 9" loading="lazy"></p>
<p><strong> Figure 9</strong></p>
<p>then our updated probability distribution would have looked like this:</p>
<p class="imgonly"><img src="http://imgur.com/s9Tw5.png" alt="" loading="lazy"></p>
<p><strong> Figure 10</strong></p>
<p>Similarly, for Result C:</p>
<p class="imgonly"><img src="http://imgur.com/9Ikc0.png" alt="" loading="lazy"></p>
<p><strong> Figure 11</strong></p>
<p><em>Bayes’ theorem</em> is the formula that calculates these updated probabilities. Using H to stand for a hypothesis (such as H<sub>1</sub>, H<sub>2</sub> or H<sub>3</sub>), and E a piece of evidence (such as Result A, Result B, or Result C), it says:</p>
<p>P(H|E) = P(H)*P(E|H)/P(E)</p>
<p>In words: to calculate the updated probability P(H|E), take the portion of the prior probability of H that is allocated to E (i.e. the quantity P(H)*P(E|H)), and calculate what fraction this is of the total prior probability of E (i.e. divide it by P(E)).</p>
<p>What I like about this way of visualizing Bayes’ theorem is that it makes the importance of prior probabilities—in particular, the difference between P(H|E) and P(E|H) -- <em>visually obvious</em>. Thus, in the above example, we easily see that even though P(C|H<sub>3</sub>) is high (80%), P(H<sub>3</sub>|C) is much less high (around 51%) -- and once you have assimilated this visualization method, it should be easy to see that even more extreme examples (e.g. with P(E|H) huge and P(H|E) tiny) could be constructed.</p>
<p>Now let’s use this to examine two tricky probability puzzles, the infamous <a href="http://en.wikipedia.org/wiki/Monty_Hall_problem">Monty Hall Problem</a> and Eliezer’s <a href="https://www.greaterwrong.com/posts/Hug2ePykMkmPzSsx6/drawing-two-aces">Drawing Two Aces</a>, and see how it illustrates the correct answers, as well as how one might go wrong.</p>
<h3><strong>The Monty Hall Problem</strong></h3>
<p>The situation is this: you’re a contestant on a game show seeking to win a car. Before you are three doors, one of which contains a car, and the other two of which contain goats. You will make an initial “guess” at which door contains the car—that is, you will select one of the doors, without opening it. At that point, the host will open a goat-containing door from among the two that you did not select. You will then have to decide whether to stick with your original guess and open the door that you originally selected, or switch your guess to the remaining unopened door. The question is whether it is to your advantage to switch—that is, whether the car is more likely to be behind the remaining unopened door than behind the door you originally guessed.</p>
<p>(If you haven’t thought about this problem before, you may want to try to figure it out before continuing...)</p>
<p>The answer is that it <em>is</em> to your advantage to switch—that, in fact, switching <em>doubles</em> the probability of winning the car.</p>
<p>People often find this counterintuitive when they first encounter it—where “people” includes the author of this post. There are two possible doors that could contain the car; why should one of them be more likely to contain it than the other?</p>
<p>As it turns out, while constructing the diagrams for this post, I “rediscovered” the error that led me to incorrectly conclude that there is a <span class="frac"><sup>1</sup>⁄<sub>2</sub></span> chance the car is behind the originally-guessed door and a <span class="frac"><sup>1</sup>⁄<sub>2</sub></span> chance it is behind the remaining door the host didn’t open. I’ll present that error first, and then show how to correct it. Here, then, is the <em>wrong</em> solution:</p>
<p>We start out with a perfectly correct diagram showing the prior probabilities:</p>
<p class="imgonly"><img src="http://imgur.com/aXwYS.png" alt="" loading="lazy"></p>
<p><strong> Figure 12</strong></p>
<p>The possible hypotheses are Car in Door 1, Car in Door 2, and Car in Door 3; before the game starts, there is no reason to believe any of the three doors is more likely than the others to contain the car, and so each of these hypotheses has prior probability <span class="frac"><sup>1</sup>⁄<sub>3</sub></span>.</p>
<p>The game begins with our selection of a door. That itself isn’t <a href="http://wiki.lesswrong.com/wiki/Evidence">evidence</a> about where the car is, of course—we’re assuming we have no particular information about that, other than that it’s behind one of the doors (that’s the whole point of the game!). Once we’ve done that, however, we will then have the opportunity to “run a test” to gain some “experimental data”: the host will perform his task of opening a door that is guaranteed to contain a goat. We’ll represent the result Host Opens Door 1 by a triangle, the result Host Opens Door 2 by a square, and the result Host Opens Door 3 by a pentagon—thus carving up our hypothesis space more finely into possibilities such as “Car in Door 1 and Host Opens Door 2” , “Car in Door 1 and Host Opens Door 3″, etc:</p>
<p class="imgonly"><img src="http://imgur.com/bIxZr.png" alt="" loading="lazy"></p>
<p> <strong>Figure 13</strong></p>
<p><strong></strong><br>Before we’ve made our initial selection of a door, the host is equally likely to open either of the goat-containing doors. Thus, at the beginning of the game, the probability of each hypothesis of the form “Car in Door X and Host Opens Door Y” has a probability of <span class="frac"><sup>1</sup>⁄<sub>6</sub></span>, as shown. So far, so good; everything is still perfectly correct.</p>
<p>Now we select a door; say we choose Door 2. The host then opens either Door 1 or Door 3, to reveal a goat. Let’s suppose he opens Door 1; our diagram now looks like this:<br><br><br><div class="imgonly"><img src="http://imgur.com/0xMQs.png" alt="" loading="lazy"></div></p>
<p> <strong>Figure 14</strong></p>
<p>But this shows equal probabilities of the car being behind Door 2 and Door 3!</p>
<p class="imgonly"><img src="http://imgur.com/07q9g.png" alt="" loading="lazy"></p>
<p> <strong>Figure 15</strong></p>
<p>Did you catch the mistake?</p>
<p>Here’s the <em>correct</em> version:<br><br><em>As soon as we selected Door 2</em>, our diagram should have looked like this:</p>
<p class="imgonly"><img src="http://imgur.com/tKGgR.png" alt="" loading="lazy"></p>
<p> <strong>Figure 16</strong></p>
<p>With Door 2 selected, the host no longer has the <em>option</em> of opening Door 2; if the car is in Door 1, he <em>must</em> open Door 3, and if the car is in Door 3, he <em>must</em> open Door 1. We thus see that if the car is behind Door 3, the host is twice as <a href="http://wiki.lesswrong.com/wiki/Likelihood_ratio">likely</a> to open Door 1 (namely, 100%) as he is if the car is behind Door 2 (50%); his opening of Door 1 thus constitutes <a href="http://wiki.lesswrong.com/wiki/Amount_of_evidence">some evidence</a> in favor of the hypothesis that the car is behind Door 3. So, when the host opens Door 1, our picture looks as follows:</p>
<p class="imgonly"><img src="http://imgur.com/5U47D.png" alt="" loading="lazy"></p>
<p> <strong>Figure 17</strong></p>
<p>which yields the correct updated probability distribution:</p>
<p class="imgonly"><img src="http://imgur.com/JYay2.png" alt="" loading="lazy"></p>
<p> <strong>Figure 18</strong></p>
<h3>Drawing Two Aces</h3>
<p>Here is the statement of the problem, from <a href="https://www.greaterwrong.com/posts/Hug2ePykMkmPzSsx6/drawing-two-aces">Eliezer’s post</a>:</p>
<blockquote>
<p><br>Suppose I have a deck of four cards: The ace of spades, the ace of hearts, and two others (say, 2C and 2D).<br><br>You draw two cards at random.<br><br>(...)<br><br>Now suppose I ask you “Do you have an ace?”<br><br>You say “Yes.”<br><br>I then say to you: “Choose one of the aces you’re holding at random (so if you have only one, pick that one). Is it the ace of spades?”<br><br>You reply “Yes.”<br><br>What is the probability that you hold two aces?</p>
</blockquote>
<p><br>(Once again, you may want to think about it, if you haven’t already, before continuing...)</p>
<p>Here’s how our picture method answers the question:</p>
<p><br>Since the person holding the cards has at least one ace, the “hypotheses” (possible card combinations) are the five shown below:</p>
<p class="imgonly"><img src="http://imgur.com/a3dxW.png" alt="" loading="lazy"></p>
<p><strong> Figure 19</strong></p>
<p>Each has a prior probability of <span class="frac"><sup>1</sup>⁄<sub>5</sub></span>, since there’s no reason to suppose any of them is more likely than any other. <br><br>The “test” that will be run is selecting an ace at random from the person’s hand, and seeing if it is the ace of spades. The possible results are:</p>
<p class="imgonly" style="--aspect-ratio: 0.7399103; max-width: 330px"><img src="http://imgur.com/XoXVj.png" alt="" loading="lazy"></p>
<p> <strong>Figure 20</strong></p>
<p>Now we run the test, and get the answer “YES”; this puts us in the following situation:</p>
<p class="imgonly"><img src="http://imgur.com/b2oLJ.png" alt="" loading="lazy"></p>
<p> <strong>Figure 21</strong></p>
<p>The total prior probability of this situation (the YES answer) is (1/6)+(1/3)+(1/3) = <span class="frac"><sup>5</sup>⁄<sub>6</sub></span>; thus, since <span class="frac"><sup>1</sup>⁄<sub>6</sub></span> is <span class="frac"><sup>1</sup>⁄<sub>5</sub></span> of <span class="frac"><sup>5</sup>⁄<sub>6</sub></span> (that is, (1/6)/(5/6) = <span class="frac"><sup>1</sup>⁄<sub>5</sub></span>), our updated probability is <span class="frac"><sup>1</sup>⁄<sub>5</sub></span> -- which happens to be the same as the prior probability. (I won’t bother displaying the final post-update picture here.)</p>
<p>What this means is that the test we ran did not provide any additional information about whether the person has both aces beyond simply knowing that they have at least one ace; we might in fact say that the result of the test is <a href="http://wiki.lesswrong.com/wiki/Screening_off">screened off</a> by the answer to the first question (“Do you have an ace?”).</p>
<p><br>On the other hand, if we had simply asked “Do you have the ace of spades?”, the diagram would have looked like this:</p>
<p class="imgonly"><img src="http://imgur.com/CWtH4.png" alt="" loading="lazy"></p>
<p> <strong>Figure 22</strong></p>
<p>which, upon receiving the answer YES, would have become:</p>
<p class="imgonly"><img src="http://imgur.com/oc1YQ.png" alt="" loading="lazy"></p>
<p> <strong>Figure 23</strong></p>
<p>The total probability mass allocated to YES is <span class="frac"><sup>3</sup>⁄<sub>5</sub></span>, and, within that, the specific situation of interest has probability <span class="frac"><sup>1</sup>⁄<sub>5</sub></span>; hence the updated probability would be <span class="frac"><sup>1</sup>⁄<sub>3</sub></span>.</p>
<p>So a YES answer in this experiment, unlike the other, would provide <a href="http://wiki.lesswrong.com/wiki/Evidence">evidence</a> that the hand contains both aces; for if the hand contains both aces, the probability of a YES answer is 100% -- twice as large as it is in the contrary case (50%), giving a <a href="http://wiki.lesswrong.com/wiki/Likelihood_ratio">likelihood ratio</a> of 2:1. By contrast, in the other experiment, the probability of a YES answer is only 50% even in the case where the hand contains both aces.</p>
<p><br>This is what people who try to explain the difference by uttering the opaque phrase “a random selection was involved!” are actually talking about: the difference between</p>
<p class="imgonly"><img src="http://imgur.com/IWwTV.png" alt="" loading="lazy"></p>
<p><strong> Figure 24</strong></p>
<p>and</p>
<p><div class="imgonly"><img src="http://imgur.com/ugqVb.png" alt="" loading="lazy"></div>.</p>
<p><strong> Figure 25</strong></p>
<p>The method explained here is far from the only way of visualizing Bayesian updates, but I feel that it is among the most intuitive.</p>
<p>(<em>I’d like to thank my sister, </em><a href="https://www.lesswrong.com/user/Vive-ut-Vivas/">Vive-ut-Vivas</a><em>, for help with some of the diagrams in this post.)</em></p>komponistoCMt3ijXYuCynhPWXaThu, 03 Jun 2010 04:40:21 +0000How to come up with verbal probabilities by jimmy
https://www.greaterwrong.com/posts/6Bz4TK37T8t5S3AbM/how-to-come-up-with-verbal-probabilities
<p>Unfortunately, we are kludged together, and we can’t just look up our probability estimates in a register somewhere when someone asks us “How sure are you?”.<br><br>The usual heuristic for putting a number on the strength of beliefs is to ask “When you’re this sure about something, what fraction of the time do you expect to be right in the long run?”. This is surely better than just “making up” numbers with no feel for what they mean, but still has it’s faults. The big one is that unless you’ve done your calibrating, you may not have a good idea of how often you’d expect to be right. <br><br>I can think of a few different heuristics to use when coming up with probabilities to assign.<br><br>1) Pretend you have to bet on it. Pretend that someone says “I’ll give you ____ odds, which side do you want?”, and figure out what the odds would have to be to make you indifferent to which side you bet on. Consider the question as if though you were <a href="http://www.overcomingbias.com/2007/06/uncovering_rati.html"><em>actually going to put money on it</em></a>. If this question is covered on a prediction market, your answer is given to you.</p>
<p>2) Ask yourself how much evidence someone would have to give you before you’re back to 50%. Since we’re trying to update according to bayes law, knowing how much evidence it takes to bring you to 50% tells you the probability you’re implicitely assigning.</p>
<p>For example, pretend someone said something like “I can guess peoples names by their looks”. If he guesses the first name right, and it’s a common name, you’ll probably write it off as fluke. The second time you’ll probably think he knew the people or is somehow fooling you, but <a href="https://www.greaterwrong.com/posts/neQ7eXuaXpiYw7SBy/the-least-convenient-possible-world">conditional on that</a>, you’d probably say he’s just lucky. By bayes law, this suggests that you put the prior probability of him pulling this stunt at 0.1%<p<3%, and less than 0.1% prior probability of him having his claimed skill. If it takes 4 correct calls to bring you to equally unsure either way, then thats about 0.03^4 if they’re common names, or one in a million<sup>1</sup>...<a id="more"></a><br><br>There’s a couple neat things about this trick. One is that it allows you to get an idea of what your subconscious level of certainty is before you ever think of it. You can imagine your immediate reaction to “Why yes, my name is Alex, how did you know” as well as your carefully deliberated response to the same data (if they’re much different, be wary of <a href="https://www.greaterwrong.com/posts/CqyJzDZWvGhhFJ7dY/belief-in-belief">belief in belief</a>). The other neat thing is that it pulls up alternate hypotheses that you find more likely, and how likely you find those to be (eg. “you know these people”).<br><br>3) Map out the typical shape of your probability distributions (ie through calibration tests) and then go by how many standard deviations off the mean you are. If you’re asked to give the probability that x<C, you can find your one sigma confidence intervals and then pull up your curve to see what it predicts based on how far out C is<sup>2</sup>.<br><br>4) Draw out your <a href="https://www.greaterwrong.com/posts/LKHJ2Askf92RBbhBp/metauncertainty">metaprobability distribution</a>, and take the mean.<br><br>You may initially have different answers for each question, and in the end you have to decide which to trust when actually placing bets.<br><br>I personally tend to lean towards 1 for intermediate probabilities, and 2 then 4 for very unlikely things. The betting model breaks down as risk gets high (either by high stakes or extreme odds), since we bet to maximize a utility function that is not linear in money.<br><br>What other techniques do you use, and to how do you weight them?<br><br><br> <br><strong>Footnotes:</strong><br><br><strong>1:</strong> A common name covers about 3% of the population, so p(b|!a) = 0.03^4 for 4 consecutive correct guesses, and p(b|a) ~=1 for sake of simplicity. Since p(a) is small, (1-p(a)) is approximated as 1.</p>
<p>p(a|b) = p(b|a)*p(a)/p(b) = p(b|a)*p(a)/(p(b|a)*p(a)+p(b|!a)*(1-p(a)) ⇒ approximately 0.5 = p(a)/(p(a)+0.03^4) ⇒ p(a) = 0.03^4 ~= <span class="frac"><sup>1</sup>⁄<sub>1,000,000</sub></span><br><br><strong>2:</strong> The idea came from <a href="https://www.greaterwrong.com/posts/ZEj9ATpv3P22LSmnC/selecting-rationalist-groups">paranoid debating</a> where Steve Rayhawk assumed a cauchy distribution. I tried to fit some data I had taken myself, but had insufficient statistics to figure out what the real shape is (if you guys have a bunch more data I could try again). It’s also worth noting that the shape of one’s probability distribution can change significantly from question to question so this would only apply in some cases.</p>jimmy6Bz4TK37T8t5S3AbMWed, 29 Apr 2009 08:35:01 +0000