# Beliefs at different timescales

Why is a chess game the op­po­site of an ideal gas? On short timescales an ideal gas is de­scribed by elas­tic col­li­sions. And a sin­gle move in chess can be mod­eled by a policy net­work.

The differ­ence is in long timescales: If we simu­lated elas­tic col­li­sions for a long time, we’d end up with a com­pli­cated dis­tri­bu­tion over the microstates of the gas. But we can’t run simu­la­tions for a long time, so we have to make do with the Boltz­mann dis­tri­bu­tion, which is a lot less ac­cu­rate.

Similarly, if we rol­led out our policy net­work to get a dis­tri­bu­tion over chess game out­comes (win/​loss/​draw), we’d get the dis­tri­bu­tion of out­comes un­der self-play. But if we’re ob­serv­ing a game be­tween two play­ers who are bet­ter play­ers than us, we have ac­cess to a more ac­cu­rate model based on their Elo rat­ings.

Can we for­mal­ize this? Sup­pose we’re ob­serv­ing a chess game. Our be­liefs about the next move are con­di­tional prob­a­bil­ities of the form , and our be­liefs about the next moves are con­di­tional prob­a­bil­ities of the form . We can trans­form be­liefs of one type into the other us­ing the operators

If we’re log­i­cally om­ni­scient, we’ll have and . But in gen­eral we will not. A chess game is short enough that is easy to com­pute, but is too hard be­cause it has ex­po­nen­tially many terms. So we can have a long-term model that is more ac­cu­rate than the rol­lout , and a short-term model that is less ac­cu­rate than . This is a sign that we’re deal­ing with an in­tel­li­gence: We can pre­dict out­comes bet­ter than ac­tions.

If in­stead of a chess game we’re pre­dict­ing an ideal gas, the rele­vant timescales are so long that we can’t com­pute or . Our long-term ther­mo­dy­namic model is less ac­cu­rate than a simu­la­tion . This is of­ten a fea­ture of re­duc­tion­ism: Com­pli­cated things can be re­duced to sim­ple things that can be mod­eled more ac­cu­rately, al­though more slowly.

In gen­eral, we can have sev­eral mod­els at differ­ent timescales, and and op­er­a­tors con­nect­ing all the lev­els. For ex­am­ple, we might have a short-term model de­scribing the physics of fund­men­tal par­ti­cles; a medium-term model de­scribing a per­son’s mo­tor ac­tions; and a long-term model de­scribing what that per­son ac­com­plishes over the course of a year. The medium-term model will be less ac­cu­rate than a rol­lout of the short-term model, and the long-term model may be more ac­cu­rate than a rol­lout of the medium-term model if the per­son is smarter than us.

• (epistemic sta­tus: physi­cist, do simu­la­tions for a liv­ing)

Our long-term ther­mo­dy­namic model Pn is less ac­cu­rate than a simulation

I think it would be fair to say that the Boltz­mann dis­tri­bu­tion and your in­stan­ti­a­tion of the sys­tem con­tain not more/​less but _differ­ent kinds of_ in­for­ma­tion.

Your simu­la­tion (as­sume in­finite pre­ci­sion for sim­plic­ity) is just one in­stan­ti­a­tion of a tra­jec­tory of your sys­tem. There’s noth­ing stochas­tic about it, it’s merely an in­ter­nally-con­sis­tent static set of con­figu­ra­tions, con­nected to each other by de­ter­minis­tic equa­tions of mo­tion.

The Boltz­mann dis­tri­bu­tion is [the math­e­mat­i­cal limit of] the dis­tri­bu­tion that you will be sam­pling from if you evolve your sys­tem, un­der a cer­tain set of con­di­tions (which are gen­er­ally very good ap­prox­i­ma­tions to a very wide va­ri­ety of phys­i­cal sys­tems). Boltz­mann tells you how likely you would be to en­counter a spe­cific con­figu­ra­tion in a run that satis­fies those con­di­tions.

I sup­pose you could say that the Boltz­mann dis­tri­bu­tion is less *pre­cise* in the sense that it doesn’t give you a definite Boolean an­swer whether a cer­tain con­figu­ra­tion will be vis­ited in a given run. On the other hand a finite num­ber of runs is nec­es­sar­ily less *ac­cu­rate* viewed as a sam­pling of the sys­tem’s con­figu­ra­tional space.

we can’t run simu­la­tions for a long time, so we have to make do with the Boltz­mann distribution

...and on the third hand, usu­ally even for a sim­ple sys­tem like a few-atom molecule the di­men­sion­al­ity of the con­figu­ra­tional space is so enor­mous any­way that you have to re­sort to some form of sam­pling (prop­a­ga­tion of equa­tions of mo­tion is one op­tion) in or­der to calcu­late your par­ti­tion func­tion (the nor­mal­iz­ing fac­tor in the Boltz­mann dis­tri­bu­tion). Yes that’s right, the Boltz­mann dis­tri­bu­tion is ac­tu­ally *ter­ribly ex­pen­sive* to com­pute for even rel­a­tively sim­ple sys­tems!

Hope these clar­ifi­ca­tions of your metaphor also help re­fine the chess part of your di­chotomy! :)

• Thanks, I didn’t know that about the par­ti­tion func­tion.

In the post I was think­ing about a situ­a­tion where we know the microstate to some pre­ci­sion, so the simu­la­tion is ac­cu­rate. I re­al­ize this isn’t re­al­is­tic.

• I’m hav­ing trou­ble fol­low­ing the part about the op­er­a­tors. Could you spell it out in words? What do the two equa­tions rep­re­sent? Why is one a mul­ti­pli­ca­tion and the other a sum?

• Sure: If we can pre­dict the next move in the chess game, we can pre­dict the next move, then the next, then the next. By iter­at­ing, we can pre­dict the whole game. If we have a prob­a­bil­ity for each next move, we mul­ti­ply them to get the prob­a­bil­ity of the game.

Con­versely, if we have a prob­a­bil­ity for an en­tire game, then we can get a prob­a­bil­ity for just the next move by adding up all the prob­a­bil­ities of all games that can fol­low from that move.

• The is­sue be­ing, of course, that when we think of pre­dict­ing the out­come of the chess game based on Elo score, we’re not mak­ing any sort of pre­dic­tion about the very next move (a feat pos­si­ble only through log­i­cal micro­science). A similar thing hap­pens with the gas, where the Boltz­mann dis­tri­bu­tion is not a dis­tri­bu­tion over his­to­ries. I don’t think this is a co­in­ci­dence.

• To me it seems like for the con­sid­er­a­tions you bring up in this post, the differ­ence be­tween the ideal gas and the chess game is that we have a near-ex­act short-timescale model for the ideal gas, but we don’t have a near-ex­act short-timescale model for the chess game.

If we knew the source code for both of the play­ers in the chess game, we could simu­late the game un­til some­one wins, and get an ac­cu­rate pre­dic­tion of the out­come, that would be bet­ter than just us­ing the ELO rat­ings.

Run­ning the ar­gu­ment through, we would con­clude that in­tel­li­gence is char­ac­ter­ized by us not hav­ing a re­duc­tion­ist model of it. Which seems not as ridicu­lous as it first sounds—we as­cribed in­tel­li­gent de­sign to eg. rain and evolu­tion be­fore we un­der­stood how they worked. Also, if we could near-ex­actly simu­late chess play­ers (in our brains, not us­ing com­put­ers), I doubt we would see them as very in­tel­li­gent.

Another dis­anal­ogy is that for an ideal gas you want to pre­dict the microstate (which the long-timescale model doesn’t get) but for the chess game you want to pre­dict the macrostate (who wins).

• Yeah, I think the fact that Elo only mod­els the macrostate makes this an im­perfect anal­ogy. I think a bet­ter anal­ogy would in­volve a hy­brid model, which as­signs a prob­a­bil­ity to a chess game based on whether each move is plau­si­ble (us­ing a policy net­work), and whether the higher-rated player won.

I don’t think the dis­tinc­tion be­tween near-ex­act and nonex­act mod­els is es­sen­tial here. I bet we could in­tro­duce ex­tra en­tropy into the short-term gas model and the rol­lout would still be su­pe­rior for pre­dict­ing the microstate than the Boltz­mann dis­tri­bu­tion.

• The no­ta­tion for the sum op­er­a­tor is un­clear. I’d ad­vise writ­ing the sum as and us­ing an sub­script in­side the sum so it’s clearer what is be­ing sub­sti­tuted where.

• The sum isn’t over , though, it’s over all pos­si­ble tu­ples of length . Any ideas for how to make that more clear?

• I find the cur­rent no­ta­tion fine, but if you want to make it more ex­plicit, you could do

• My ini­tial in­cli­na­tion is to in­tro­duce as the space of events on turn , and define and then you can ex­press it as .