# [Question] How should we model complex systems?

By “com­plex”, I mean a sys­tem for which it would be too com­pu­ta­tion­ally costly to model it from first prin­ci­ples e.g. the econ­omy, the cli­mate (my field, by the way). Sup­pose our goal is to pre­dict a sys­tem’s fu­ture be­havi­our with min­i­mum pos­si­ble er­ror given by some met­ric (e.g. min­imise the mean square er­ror or max­imise the like­li­hood). This seems like some­thing we would want to do in an op­ti­mal way, and also some­thing a su­per­in­tel­li­gence should have a strat­egy to do, so I thought I’d ask here if any­one has worked on this prob­lem.

I’ve read quite a bit about how we can op­ti­mally try to de­duce the truth e.g. ap­ply Bayes’ the­o­rem with a prior set fol­low­ing Ock­ham’s ra­zor (c.f. Solomonoff in­duc­tion). How­ever, this seems difficult to me to ap­ply to mod­el­ling com­plex sys­tems, even as an ideal­i­sa­tion, be­cause:

1. Since we can­not af­ford to model the true equa­tions, ev­ery mem­ber of the set of mod­els available to us is false, so the like­li­hood and pos­te­rior prob­a­bil­ity for each will typ­i­cally eval­u­ate to zero given enough ob­served data. So if we want to use Bayes’ the­o­rem, the prob­a­bil­ities should not mean the prob­a­bil­ity of each model be­ing true. But it’s not clear to me what they should mean—per­haps the prob­a­bil­ity that each model will give the pre­dic­tion with the low­est er­ror? But then it’s not clear how to do up­dat­ing, if the nor­mal like­li­hoods will typ­i­cally be zero.

2. It doesn’t seem clear that Ock­ham’s ra­zor will be a good guide to giv­ing our mod­els prior prob­a­bil­ities. Its use seems to be mo­ti­vated by it work­ing well for de­duc­ing fun­da­men­tal laws of na­ture. How­ever, for mod­el­ling com­plex sys­tems it seems more rea­son­able to me to give more weight to mod­els that in­cor­po­rate what we un­der­stand to be the im­por­tant pro­cesses—and past ob­ser­va­tions can’t nec­es­sar­ily help us tell what pro­cesses are im­por­tant to in­clude, be­cause differ­ent pro­cesses may be­come im­por­tant in fu­ture (c.f. biolog­i­cal feed­backs that may kick in as the cli­mate warms). This could per­haps be done by hav­ing a strat­egy for de­riv­ing ap­prox­i­mate af­ford­able mod­els from the fun­da­men­tal laws—but is it pos­si­ble to say any­thing about how an agent should do this?

I’ve not found any­thing about ra­tio­nal strate­gies to ap­prox­i­mately model com­plex sys­tems rather than de­rive true mod­els. Thank you very much for any thoughts and re­sources you can share.

• I think it’s an open ques­tion whether we can gen­er­ally model com­plex sys­tems at all – at least in the sense of be­ing able to make pre­cise pre­dic­tions about the de­tailed state of en­tire com­plex sys­tems.

But there’s still ways to make progress at mod­el­ing and pre­dict­ing as­pects of com­plex sys­tems, e.g. ag­gre­gate info, dy­nam­ics, pos­si­ble gen­eral states.

The de­tailed be­hav­ior of a macro­scopic quan­tity of in­di­vi­d­ual molecules is com­plex and im­pos­si­ble to pre­dict in de­tail at the level of in­di­vi­d­ual molecules, but we can ac­cu­rately pre­dict some things for some of these sys­tems: the over­all tem­per­a­ture, the rel­a­tive quan­tities of differ­ent types of molecules, etc.

Some po­ten­tially com­plex sys­tems ex­hibit be­hav­ior that is globally, or at some level, ‘sim­ple’ in some way, i.e. rel­a­tively static or repet­i­tive, nested, or ran­dom. This is where sim­ple math­e­mat­i­cal or statis­tic mod­el­ing works best.

Statis­ti­cal me­chan­ics and chem­istry are good ex­am­ples of this.

The hard­est com­plex sys­tems to model in­volve, at some level, an in­ter­play of repet­i­tive and ran­dom be­hav­ior. This of­ten in­volves ‘struc­tures’ whose in­di­vi­d­ual his­tory af­fects the global state of the sys­tem on long-enough timescales. Some­times the only way to pre­cisely pre­dict the fu­ture of the de­tailed state of these kinds of sys­tems is to simu­late them ex­actly.

Biol­ogy, eco­nomics, and cli­ma­tol­ogy are good ex­am­ples of sub­jects that study these kinds of sys­tems.

For the most com­plex sys­tems, of­ten the best we can do is pre­dict the pos­si­ble or prob­a­ble pres­ence of kinds or cat­e­gories of dy­nam­ics or pat­terns of be­hav­ior. In essence, we don’t try to model an en­tire in­di­vi­d­ual com­plex sys­tem as a whole, in de­tail, but fo­cus on mod­el­ing parts of a more gen­eral class of those kinds of sys­tems.

This can be thought of as ‘bot­tom-up’ mod­el­ing. Some ex­am­ples: mod­el­ing senes­cence, bank runs, or re­gional cli­mate cy­cles.

I’ve not found any­thing about ra­tio­nal strate­gies to ap­prox­i­mately model com­plex sys­tems rather than de­rive true mod­els.

I in­ter­pret “ap­prox­i­mately model com­plex sys­tems” as ‘top-down’ ‘statis­ti­cal’ mod­el­ing – that can be use­ful re­gard­less, even if it’s wrong, but might be rea­son­ably ac­cu­rate if the sys­tem is rel­a­tively ‘sim­ple’. But if the sys­tem is com­plex to the ‘worst’ de­gree, then we need to “de­rive true mod­els” for at least some parts of the sys­tem and ap­prox­i­mately model the global sys­tem us­ing some­thing like a ‘hi­er­ar­chi­cal’ model built from ‘smaller’ mod­els.

In full gen­er­al­ity, an­swer­ing this ques­tion de­mands a com­plete episte­mol­ogy and de­ci­sion the­ory!

For ‘sim­ple’ com­plex sys­tems, we may be able to pre­dict their fu­ture fairly ac­cu­rately. For the most com­plex sys­tems, of­ten we can only wait to dis­cover their fu­ture states – in de­tail – but we may be able to pre­dict some sub­set of the over­all sys­tem (in time and ‘space’) in the in­terim.

• Thanks for your re­ply. (I re­peat my apol­ogy from be­low for not ap­par­ently be­ing able to use for­mat­ting op­tions in my browser in this.)

“I think it’s an open ques­tion whether we can gen­er­ally model com­plex sys­tems at all – at least in the sense of be­ing able to make pre­cise pre­dic­tions about the de­tailed state of en­tire com­plex sys­tems.”

I agree mod­el­ling the de­tailed state is per­haps not pos­si­ble. How­ever, there are at least some com­plex sys­tems we can model and get sub­stan­tial pos­i­tive skill at pre­dict­ing par­tic­u­lar vari­ables with­out need­ing to model all the de­tails e.g. the weather, for par­tic­u­lar vari­ables up to a par­tic­u­lar amount of time ahead, and pre­dic­tions of global mean warm­ing made from the 1980s seem to have val­i­dated quite well so far (for decadal av­er­ages). So hu­man minds seem to suc­ceed at least some­times, but with­out seem­ing to fol­low a par­tic­u­lar al­gorithm. Pre­sum­ably it’s pos­si­ble to do bet­ter, so my ques­tion is es­sen­tially how would an al­gorithm that could do bet­ter look?

I agree that statis­ti­cal me­chan­ics is one use­ful set of meth­ods. But, think­ing of the area of cli­mate model de­vel­op­ment that I know some­thing about, statis­ti­cal av­er­ag­ing of fluid me­chan­ics does form the back­bone to mod­el­ling the at­mo­sphere and oceans, but adding rep­re­sen­ta­tions of pro­cesses that are missed by that has added a lot of value (e.g. trop­i­cal thun­der­storms that are well be­low the spac­ing of the nu­mer­i­cal grid over which the fluid me­chan­ics equa­tions were av­er­aged). So there seems to be some­thing ad­di­tional to av­er­ag­ing that can be used, to do with com­ing up with sim­plified mod­els of pro­cesses you can see are missed out by the av­er­ag­ing. It would be nice to have an al­gorithm for that, but that’s prob­a­bly ask­ing for a lot...

“I in­ter­pret “ap­prox­i­mately model com­plex sys­tems” as ‘top-down’ ‘statis­ti­cal’ mod­el­ing”

I didn’t mean this to be top-down rather than bot­tom-up—it could fol­low what­ever mod­el­ling strat­egy is de­ter­mined to be op­ti­mal.

“an­swer­ing this ques­tion de­mands a com­plete episte­mol­ogy and de­ci­sion the­ory!”

That’s what I was wor­ried about… (though, is de­ci­sion the­ory rele­vant when we just want to pre­dict a given sys­tem and max­imise a pre-speci­fied skill met­ric?)

• (I’m not sure if there are for­mat­ting op­tions any­more in the site UI – for­mat­ting is (or can be) done via Mark­down syn­tax. In your user set­tings, there’s a “Ac­ti­vate Mark­down Edi­tor” op­tion that you might want to test chang­ing if you don’t want to use Mark­down di­rectly.)

So hu­man minds seem to suc­ceed at least some­times, but with­out seem­ing to fol­low a par­tic­u­lar al­gorithm. Pre­sum­ably it’s pos­si­ble to do bet­ter, so my ques­tion is es­sen­tially how would an al­gorithm that could do bet­ter look?

I think ‘al­gorithm’ is an im­pre­cise term for this dis­cus­sion. I don’t think there are any al­gorithms similar to a pro­to­typ­i­cal ex­am­ple of a com­pu­ta­tional ‘al­gorithm’ that could pos­si­bly do a bet­ter job, in gen­eral, than hu­man minds. In the ‘ex­pan­sive’ sense of ‘al­gorithm’, an AGI could pos­si­bly do bet­ter, but we don’t know how to build those yet!

statis­ti­cal av­er­ag­ing of fluid me­chan­ics does form the back­bone to mod­el­ling the at­mo­sphere and oceans, but adding rep­re­sen­ta­tions of pro­cesses that are missed by that has added a lot of value (e.g. trop­i­cal thun­der­storms that are well be­low the spac­ing of the nu­mer­i­cal grid over which the fluid me­chan­ics equa­tions were av­er­aged). So there seems to be some­thing ad­di­tional to av­er­ag­ing that can be used, to do with com­ing up with sim­plified mod­els of pro­cesses you can see are missed out by the av­er­ag­ing. It would be nice to have an al­gorithm for that, but that’s prob­a­bly ask­ing for a lot...

There might be al­gorithms that could in­di­cate whether, or how likely, it is that a model is ‘miss­ing’ some­thing, solv­ing that prob­lem gen­er­ally would re­quire ac­cess to the ‘tar­get’ sys­tem like we have (i.e. by al­most en­tirely liv­ing in­side of it). If you think about some­thing like us­ing an (AI) ‘learn­ing’ al­gorithm, you wouldn’t ex­pect it to be able to learn about as­pects of the sys­tem that aren’t pro­vided to it as in­put. But how could we fea­si­bly, or even in prin­ci­ple, provide the Earth’s cli­mate as in­put, i.e. what would we mea­sure (and how would we do it)?

I in­ter­pret “ap­prox­i­mately model com­plex sys­tems” as ‘top-down’ ‘statis­ti­cal’ modeling

I didn’t mean this to be top-down rather than bot­tom-up—it could fol­low what­ever mod­el­ling strat­egy is de­ter­mined to be op­ti­mal.

What I was sketch­ing was some­thing like how we cur­rently model com­plex sys­tems. It can be very helpful to model a sys­tem top-down, e.g. statis­ti­cally, by fo­cus­ing on rel­a­tively sim­ple global at­tributes. The in­puts for fluid me­chan­ics mod­els of the cli­mate are an ex­am­ple of that. Run­ning those mod­els is a mix of top-down and bot­tom-up. The model de­tails are gen­er­ated top-down, but study­ing the dy­nam­ics of those mod­els in de­tail is more bot­tom-up.

an­swer­ing this ques­tion de­mands a com­plete episte­mol­ogy and de­ci­sion the­ory!

That’s what I was wor­ried about… (though, is de­ci­sion the­ory rele­vant when we just want to pre­dict a given sys­tem and max­imise a pre-speci­fied skill met­ric?)

Any al­gorithm is in effect a de­ci­sion the­ory. A gen­eral al­gorithm for mod­el­ing ar­bi­trary com­plex sys­tems would effec­tively make a vast num­ber of de­ci­sions. I sus­pect find­ing or build­ing a fea­si­ble and prof­itable al­gorithm like this will also effec­tively re­quire “a com­plete episte­mol­ogy and de­ci­sion the­ory”.

We already have a lot of fan­tas­ti­cally effec­tive tools for cre­at­ing top-down mod­els.

But when the best of the top-down mod­els we can make aren’t good enough, we might need to con­sider in­cor­po­rat­ing el­e­ments from bot­tom-up mod­els that aren’t already in­cluded in what we’re mea­sur­ing, and try­ing to pre­dict, at the top-down level, e.g. in­clud­ing cy­clones in a coarse fluid me­chan­ics model. Note that cy­clones are a perfect ex­am­ple of what I referred to as:

‘struc­tures’ whose in­di­vi­d­ual his­tory af­fects the global state of the sys­tem.

We need good de­ci­sion the­o­ries to know when to search for more or bet­ter bot­tom-up mod­els. What are we miss­ing? How should we search? (When should we give up?)

The name for ‘al­gorithms’ (in the ex­pan­sive sense) that can do what you’re ask­ing is ‘gen­eral in­tel­li­gence’. But we’re still work­ing on un­der­stand­ing them!

Con­cretely, the fun­da­men­tal prob­lem with de­vel­op­ing a gen­eral al­gorithm to “ap­prox­i­mately model com­plex sys­tems” is ac­quiring the nec­es­sary data to feed the al­gorithm as in­put. What’s the min­i­mum amount and type of data that we need to ap­prox­i­mately model the Earth’s cli­mate (well enough)? If we don’t already have that data, how can the gen­eral al­gorithm ac­quire it? In gen­eral, only a gen­eral in­tel­li­gence that can act as an agent is ca­pa­ble of do­ing that (e.g. de­cid­ing what new mea­sure­ments to perform and then ac­tu­ally perform­ing them).

A vague and im­pre­cise sketch of a gen­eral al­gorithm might be:

1. Mea­sure the com­plex sys­tem some­how.

2. Use some ‘learn­ing al­gorithm’ to gen­er­ate an “ap­prox­i­mate” model.

3. Is the model pro­duced in [2] good enough? Yes? Profit! (And stop ex­e­cut­ing this al­gorithm.) No? Con­tinue.

4. Try some com­bi­na­tion of differ­ent ways to mea­sure the sys­tem [1] and differ­ent learn­ing al­gorithms to gen­er­ate mod­els [2].

Note that steps [1], [2], and [4] are, to vary­ing de­grees, bot­tom-up mod­el­ing. [4] tho also in­cor­po­rates a heavy ‘top-down’ per­spec­tive, e.g. de­ter­min­ing/​es­ti­mat­ing/​guess­ing what is miss­ing from that per­spec­tive.

[1] might in­volve mod­el­ing how well differ­ent ‘lev­els’ of the ac­tual com­plex sys­tem can be mod­eled. Dis­cov­er­ing ‘struc­tures’ in some level is good ev­i­dence that ad­di­tional mea­sure­ments may be re­quired to model lev­els above that.

[2] might in­volve dis­cov­er­ing or de­vel­op­ing new math­e­mat­i­cal or com­pu­ta­tional the­o­ries and al­gorithms, i.e. info about sys­tems in gen­eral.

• Thanks again. OK I’ll try us­ing MarkDown...

I think ‘al­gorithm’ is an im­pre­cise term for this dis­cus­sion.

Per­haps I used the term im­pre­cisely—I ba­si­cally meant it in a very gen­eral sense of be­ing some pro­cess, set of rules etc. that a com­puter or other agent could fol­low to achieve the goal.

We need good de­ci­sion the­o­ries to know when to search for more or bet­ter bot­tom-up mod­els. What are we miss­ing? How should we search? (When should we give up?)

The name for ‘al­gorithms’ (in the ex­pan­sive sense) that can do what you’re ask­ing is ‘gen­eral in­tel­li­gence’. But we’re still work­ing on un­der­stand­ing them!

Yes I see the rele­vance of de­ci­sion the­o­ries there and that solv­ing this well would be re­quiring a lot of what would be needed for AGI. I guess when I origi­nally asked, I was won­der­ing if there might have been some in­sights peo­ple had worked out on the way to that—just any parts of such an al­gorithm that peo­ple have figured out, or that at least would re­duce the er­ror of a typ­i­cal sci­en­tist. But maybe that will be an­other while yet...

I think you’re right that such an al­gorithm would need to make mea­sure­ments of the real sys­tem, or sys­tems with prop­er­ties match­ing com­po­nent parts (e.g. a tank of air for cli­mate), and have some way to iden­tify the best mea­sure­ments to make. I guess de­ter­min­ing whether there is some im­por­tant effect that’s not been ac­counted for yet would re­quire a cer­tain amount of ran­dom ex­per­i­men­ta­tion to be done (e.g. for cli­mate, heat­ing up patches of land and tanks of ocean wa­ter by a few de­grees and see­ing what hap­pens to the ecol­ogy, just as we might do).

This is not nec­es­sar­ily im­prac­ti­cal for some­thing like at­mo­spheric or oceanic mod­el­ling, where we can run trust­wor­thy high-re­s­olu­tion mod­els over small spa­tial re­gions and get data on how things change with differ­ent bound­ary con­di­tions, so we can tell how the coarse mod­els should be­have. So then crite­ria for de­cid­ing where and when to run these simu­la­tions would be needed. Re­gions where er­rors com­pared to Earth ob­ser­va­tions are large and re­gions that ex­hibit rel­a­tively large changes with global warm­ing could be a high pri­or­ity. I’d have to think if there could be a sen­si­ble sys­tem­atic way of do­ing it—I guess it would re­quire an es­ti­mate of how much the met­ric of fu­ture pre­dic­tion skill would de­crease with in­for­ma­tion gained from a par­tic­u­lar ex­per­i­ment, which could per­haps be ap­prox­i­mated us­ing the sen­si­tivity of the fu­ture pre­dic­tion to the es­ti­mated er­ror or un­cer­tainty in pre­dic­tions of a par­tic­u­lar vari­able. I’d need to think about that more.

• I was won­der­ing if there might have been some in­sights peo­ple had worked out on the way to that—just any parts of such an al­gorithm that peo­ple have figured out, or that at least would re­duce the er­ror of a typ­i­cal sci­en­tist.

There are some pretty gen­eral learn­ing al­gorithms, and even ‘meta-learn­ing’ al­gorithms in the form of tools that at­tempt to more or less au­to­mat­i­cally dis­cover the best model (among some num­ber of pos­si­bil­ities). Ma­chine learn­ing hy­per-pa­ram­e­ter op­ti­miza­tion is an ex­am­ple in that di­rec­tion.

My out­side view is that a lot of sci­en­tists should fo­cus on run­ning bet­ter ex­per­i­ments. Ac­cord­ing to a pos­si­bly apoc­ryphal story told by Richard Feyn­man in a com­mence­ment ad­dress, one re­searcher dis­cov­ered (at least some of) the con­trols one had to em­ploy to be able to effec­tively study mice run­ning mazes. Un­for­tu­nately, no one else both­ered to em­ploy those con­trols (let alone look for oth­ers)! Similarly, a lot of sci­en­tific stud­ies or ex­per­i­ments are sim­ply too small to pro­duce even re­li­able statis­ti­cal info. There’s prob­a­bly a lot of such low hang­ing fruit available. Tho note that this is of­ten a ‘bot­tom-up’ con­tri­bu­tion for ‘mod­el­ing’ a larger com­plex sys­tem.

But as you demon­strate in your last two para­graphs, search­ing for a bet­ter ‘on­tol­ogy’ for your mod­els, e.g. de­cid­ing what else to mea­sure, or what to mea­sure in­stead, is a seem­ingly open-ended amount of work! There prob­a­bly isn’t a way to avoid hav­ing to think about it more (be­yond mak­ing other kinds of things that can think for us); un­til you find an on­tol­ogy that’s ‘good enough’ any­ways. Re­gard­less, we’re very far from be­ing able to avoid even small amounts of this kind of work.

• [Meta] Cu­ri­ous what browser you are us­ing, so I can figure out whether any­one else has this prob­lem.

• I’m us­ing Chrome 80.0.3987.163 in Mac OSX 10.14.6. But I also tried it in Fire­fox and didn’t get for­mat­ting op­tions. But maybe I’m just do­ing the wrong thing...

• You do cur­rently have the mark­down ed­i­tor ac­ti­vated, which gets rid of all for­mat­ting op­tions, so you not get­ting it right now wouldn’t sur­prise me. But you should have got­ten them be­fore you ac­ti­vated the mark­down ed­i­tor.

• Yes I’d se­lected that be­cause I thought it might get it to work. And now I’ve un­s­e­lected it, it seems to be work­ing. It’s pos­si­ble this was a glitch some­where or me just be­ing dumb be­fore I guess.

• Huh, okay. Sorry for the weird ex­pe­rience!

• This is a pretty good ques­tion, but there is a gen­eral norm on LessWrong to only use the word “ra­tio­nal” when it re­ally can’t be avoided. See:

Only say ‘ra­tio­nal’ when you can’t elimi­nate the word

• OK, I made some ed­its. I left the “ra­tio­nal” in the last para­graph be­cause it seemed to me to be the best word to use there.

• A les­son from last 30 years AI de­vel­op­ment: data and com­pu­ta­tion power are the key fac­tor of im­prove­ment.

Thus, IMPHO,,for ob­tain­ing a bet­ter model, the most re­li­able ap­proach is to get more data.

• I’d push back on this pretty strongly: data and com­pu­ta­tion power, de­void of prin­ci­pled mod­el­ling, have his­tor­i­cally failed very badly to make for­ward-look­ing pre­dic­tions, es­pe­cially in eco­nomics. That was ex­actly the topic of the fa­mous Lu­cas cri­tique. The main prob­lem is causal­ity: brute-force mod­els usu­ally just learn dis­tri­bu­tions, so they com­pletely fail when dis­tri­bu­tions shift.

• If a re­searcher was given 1000X more data, 1000X CPU power, would he switch to a brute-force ap­proach? I did not see the con­nec­tion be­tween “data and com­pu­ta­tion power” and the brute-force mod­els.

• A sim­ple toy model: a roll a pair of dice many, many times. If we have a suffi­ciently large amount of data and com­pu­ta­tional power, then we can brute-force fit the dis­tri­bu­tion of out­comes—i.e. we can count how many times each pair of num­bers is rol­led, es­ti­mate the dis­tri­bu­tion of out­comes based solely on that, and get a very good fit to the dis­tri­bu­tion.

By con­trast, if we have only a small amount of data/​com­pute, we need to be more effi­cient in or­der to get a good es­ti­mate of the dis­tri­bu­tion. We need a prior which ac­counts for the fact that there are two dice whose out­comes are prob­a­bly roughly in­de­pen­dent, or that the dice are prob­a­bly roughly sym­met­ric. Lev­er­ag­ing that model struc­ture is more work for the pro­gram­mer—we need to code that struc­ture into the model, and check that it’s cor­rect, and so forth—but it lets us get good re­sults with less data/​com­pute.

So nat­u­rally, given more data/​com­pute, peo­ple will avoid that ex­tra mod­el­ling/​pro­gram­ming work and lean to­wards more brute-force mod­els—es­pe­cially if they’re just mea­sur­ing suc­cess by fit to their data.

But then, the dis­tri­bu­tion shifts—maybe one of the dice is swapped out for a weighted die. Be­cause our brute force model has no in­ter­nal struc­ture, it doesn’t have a way to re-use its in­for­ma­tion. It doesn’t have a model of “two dice”, it just has a model of “dis­tri­bu­tion of out­comes”—there’s no no­tion of some out­comes cor­re­spond­ing to the same face on one of the two dice. But the more prin­ci­pled model does have that in­ter­nal struc­ture, so it can nat­u­rally re-use the still-valid sub­com­po­nents of the model when one sub­com­po­nent changes.

Con­versely, ad­di­tional data/​com­pute doesn’t re­ally help us make our mod­els more prin­ci­pled—that’s mainly a prob­lem of mod­el­ling/​pro­gram­ming which cur­rently needs to be han­dled by hu­mans. To the ex­tent that gen­er­al­iz­abil­ity is the limit­ing fac­tor to use­ful­ness of mod­els, ad­di­tional data/​com­pute alone doesn’t help much—and in­deed, de­spite the flag­ship ap­pli­ca­tions in vi­sion and lan­guage, most of to­day’s brute-force-ish deep learn­ing mod­els do gen­er­al­ize very poorly.

• This would make a good an­swer.

• A lot of AI de­vel­op­ment has been in rel­a­tively ‘toy’ do­mains – com­pared to mod­el­ing the Earth’s cli­mate!

Some­times what is needed be­yond just more data (of the same type) is a differ­ent type of data.

• It is rarely too difficult to spec­ify the true model (or a space of mod­els con­tain­ing the true model). What’s hard is up­dat­ing on less-than-fully-in­for­ma­tive ev­i­dence or, in some cases, even com­put­ing what the true model pre­dicts at all (i.e. like­li­hoods). So when we say it is “too costly to model from first prin­ci­ples”, we should keep in mind that we don’t mean the true model space can’t even be writ­ten down effi­ciently. In par­tic­u­lar, this means that “ev­ery mem­ber of the set of mod­els available to us is false” need not hold. Similarly, Bayesian prob­a­bil­ity and Ock­ham’s ra­zor and what­not can still ap­ply, but we need effi­cient ap­prox­i­ma­tions.

(Side note: “differ­ent pro­cesses may be­come im­por­tant in fu­ture” is not ac­tu­ally a prob­lem for Ock­ham’s ra­zor per se. That’s a prob­lem for causal mod­els, and Bayesian prob­a­bil­ity + Ock­ham’s ra­zor are quite ca­pa­ble of learn­ing causal mod­els.)

(Another side note: like­li­hoods are never ac­tu­ally zero, they’re just very small. But like­li­hoods are very small for any large amount of data any­way, so there’s noth­ing un­usual about that; a model space which doesn’t con­tain the true model isn’t re­ally a prob­lem from that per­spec­tive.)

If we want to at­tack these sorts of prob­lems rigor­ously from first prin­ci­ples, then the cen­tral challenge is to find rigor­ous ap­prox­i­ma­tions of the true un­der­ly­ing mod­els. The main field I know of which stud­ies this sort of prob­lem di­rectly is statis­ti­cal me­chan­ics, and a num­ber of rea­son­ably-gen­eral-pur­pose tools ex­ist in that field which could po­ten­tially be ap­plied in other ar­eas (e.g. this). Ac­tu­ally de­vel­op­ing those ap­pli­ca­tions, how­ever, is an area of ac­tive re­search.

That said… when I look at the his­tory of failure of “statis­ti­cal”, non-first-prin­ci­ples mod­els in var­i­ous fields (es­pe­cially eco­nomics), it looks like they mainly fail be­cause they don’t han­dle causal­ity prop­erly. That makes sense—the the­ory of causal­ity is a rel­a­tively re­cent de­vel­op­ment, so of course 20th-cen­tury stats peo­ple built mod­els which failed to han­dle it. Armed with mod­ern tools, it’s en­tirely plau­si­ble that we can han­dle causal­ity well with­out hav­ing to ground ev­ery­thing in first-prin­ci­ples.

• Thanks for your de­tailed re­ply. (And sorry I couldn’t for­mat the be­low well—I don’t seem to get any for­mat­ting op­tions in my browser.)

“It is rarely too difficult to spec­ify the true model...this means that “ev­ery mem­ber of the set of mod­els available to us is false” need not hold”

I agree we could find a true model to ex­plain the econ­omy, cli­mate etc. (pre­sum­ably the the­ory of ev­ery­thing in physics). But we don’t have the com­pu­ta­tional power to make pre­dic­tions of such sys­tems with that model—so my ques­tion is about how should we make pre­dic­tions when the true model is not prac­ti­cally ap­pli­ca­ble? By “the set of mod­els available to us”, I meant the mod­els we could ac­tu­ally af­ford to make pre­dic­tions with. If the true model is not in that set, then it seems to be that all of these mod­els must be false.

‘”differ­ent pro­cesses may be­come im­por­tant in fu­ture” is not ac­tu­ally a prob­lem for Ock­ham’s ra­zor per se. That’s a prob­lem for causal mod­els’

To take the cli­mate ex­am­ple, say sci­en­tists had figured out that there were a biolog­i­cal feed­back that kicks in once global warm­ing has gone past 2C (e.g. bac­te­ria be­come more effi­cient at de­com­pos­ing soil and re­leas­ing CO2). Sup­pose you have one model that in­cludes a rep­re­sen­ta­tion of that feed­back (e.g. as a sub­pro­cess) and one that does not but is equiv­a­lent in ev­ery other way (e.g. is coded like the first model but lacks the sub­pro­cess). Then isn’t the sec­ond model sim­pler ac­cord­ing to met­rics like the min­i­mum de­scrip­tion length, so that it would be weighted higher if we pe­nal­ised mod­els us­ing such met­rics? But this seems the wrong thing to do, if we think the first model is more likely to give a good pre­dic­tion.

Now the thought that oc­curred to me when writ­ing that is that the data the sci­en­tists used to de­duce the ex­is­tence of the feed­back ought to be ac­counted for by the mod­els that are used, and this would give low pos­te­rior weight to mod­els that don’t in­clude the feed­back. But do­ing this in prac­tice seems hard. Also, it’s not clear to me if there would be a way to tell be­tween mod­els that rep­re­sent the pro­cess but don’t con­nect it prop­erly to pre­dict­ing the cli­mate e.g. they have a sub­pro­cess that says more CO2 is pro­duced by bac­te­ria at warm­ing higher than 2C, but then don’t ac­tu­ally add this CO2 to the at­mo­sphere, or some­thing.

“like­li­hoods are never ac­tu­ally zero, they’re just very small”

If our mod­els were de­ter­minis­tic, then if they were not true, wouldn’t it be im­pos­si­ble for them to pro­duce the ob­served data ex­actly, so that the like­li­hood of the data given any of those mod­els would be zero? (Un­less there was more than one pro­cess that could give rise to the same data, which seems un­likely in prac­tice.) Now if we make the mod­els prob­a­bil­is­tic and try to de­sign them such that there is a non-zero chance that the data would be a pos­si­ble sam­ple from the model, then the like­li­hood can be non-zero. But it doesn’t seem nec­es­sary to do this—mod­els that are false can still give pre­dic­tions that are use­ful for de­ci­sion-mak­ing. Also, it’s not clear if we could make a prob­a­bil­is­tic model that would have non-zero like­li­hoods for some­thing as com­plex as the cli­mate that we could run on our available com­put­ers (and that isn’t some­thing ob­vi­ously of low value for pre­dic­tion like just giv­ing prob­a­bil­ity 1/​N to each of N days of ob­served data). So it still seems like it would be valuable to have a prin­ci­pled way of pre­dict­ing us­ing mod­els that give a zero like­li­hood of the data.

“the cen­tral challenge is to find rigor­ous ap­prox­i­ma­tions of the true un­der­ly­ing mod­els. The main field I know of which stud­ies this sort of prob­lem di­rectly is statis­ti­cal me­chan­ics, and a num­ber of rea­son­ably-gen­eral-pur­pose tools ex­ist in that field which could po­ten­tially be ap­plied in other ar­eas (e.g. this).”

Yes I agree. Thanks for the link—it looks very rele­vant and I’ll check it out. Edit—I’ll just add, echo­ing part of my re­ply to Kenny’s an­swer, that whilst statis­ti­cal av­er­ag­ing has got hu­man mod­el­lers a cer­tain dis­tance, adding rep­re­sen­ta­tions of pro­cesses whose effects get missed by the av­er­ag­ing seems to add a lot of value (e.g. trop­i­cal thun­der­storms in the case of cli­mate). So there seems to be some­thing ad­di­tional to av­er­ag­ing that can be used, to do with com­ing up with sim­plified mod­els of pro­cesses you can see are missed out by the av­er­ag­ing.

On causal­ity, whilst of course cor­rect­ing this is de­sir­able, if the mod­els we can af­ford to com­pute with can’t re­pro­duce the data, then pre­sum­ably they are also not re­pro­duc­ing the cor­rect causal graph ex­actly? And any causal graph we could com­pute with will not be able to re­pro­duce the data? (Else it would seem that a causal graph could some­how hugely com­press the true equa­tions with­out in­for­ma­tion loss—great if so!)

• Side note: one topic I’ve been read­ing about re­cently which is di­rectly rele­vant to some of your ex­am­ples (e.g. thun­der­storms) is mul­ti­s­cale mod­el­ling. You might find it in­ter­est­ing.

• Thanks, yes this is very rele­vant to think­ing about cli­mate mod­el­ling, with the dom­i­nant paradigm be­ing that we can sep­a­rately model phe­nom­ena above and be­low the re­solved scale—there’s an on­go­ing de­bate, though, about whether a differ­ent ap­proach would work bet­ter, and it gets tricky when the re­solved scale gets close to the size of im­por­tant types of weather sys­tem.

• To take the cli­mate ex­am­ple, say sci­en­tists had figured out that there were a biolog­i­cal feed­back that kicks in once global warm­ing has gone past 2C (e.g. bac­te­ria be­come more effi­cient at de­com­pos­ing soil and re­leas­ing CO2). Sup­pose you have one model that in­cludes a rep­re­sen­ta­tion of that feed­back (e.g. as a sub­pro­cess) and one that does not but is equiv­a­lent in ev­ery other way (e.g. is coded like the first model but lacks the sub­pro­cess). Then isn’t the sec­ond model sim­pler ac­cord­ing to met­rics like the min­i­mum de­scrip­tion length, so that it would be weighted higher if we pe­nal­ised mod­els us­ing such met­rics? But this seems the wrong thing to do, if we think the first model is more likely to give a good pre­dic­tion.

The trick here is that the data on which the model is trained/​fit has to in­clude what­ever data the sci­en­tists used to learn about that feed­back loop in the first place. As long as that data is in­cluded, the model which ac­counts for it will have lower min­i­mum de­scrip­tion length. (This fits in with a gen­eral theme: the min­i­mum-com­plex­ity model is sim­ple and gen­eral-pur­pose; the de­tails are learned from the data.)

Now the thought that oc­curred to me when writ­ing that is that the data the sci­en­tists used to de­duce the ex­is­tence of the feed­back ought to be ac­counted for by the mod­els that are used, and this would give low pos­te­rior weight to mod­els that don’t in­clude the feed­back. But do­ing this in prac­tice seems hard.

… I’m re­spond­ing as I read. Yup, ex­actly. As the Bayesi­ans say, we do need to ac­count for all our prior in­for­ma­tion if we want re­li­ably good re­sults. In prac­tice, this is “hard” in the sense of “it re­quires sig­nifi­cantly more com­pli­cated pro­gram­ming”, but not in the sense of “it in­creases the asymp­totic com­pu­ta­tional com­plex­ity”. The pro­gram­ming is more com­pli­cated mainly be­cause the code needs to ac­cept sev­eral qual­i­ta­tively differ­ent kinds of data, and cus­tom code is likely needed for hook­ing up each of them. But that’s not a fun­da­men­tal bar­rier; it’s still the same com­pu­ta­tional challenges which make the ap­proach im­prac­ti­cal.

it’s not clear to me if there would be a way to tell be­tween mod­els that rep­re­sent the pro­cess but don’t con­nect it prop­erly to pre­dict­ing the cli­mate...

Again, we need to in­clude what­ever data al­lowed sci­en­tists to con­nect it to the cli­mate in the first place. (In some cases this is just fun­da­men­tal physics, in which case it’s already in the model.)

If our mod­els were de­ter­minis­tic, then if they were not true, wouldn’t it be im­pos­si­ble for them to pro­duce the ob­served data ex­actly, so that the like­li­hood of the data given any of those mod­els would be zero? (Un­less there was more than one pro­cess that could give rise to the same data, which seems un­likely in prac­tice.)

Pic­ture a de­ter­minis­tic model which uses fun­da­men­tal physics, and mod­els the joint dis­tri­bu­tion of po­si­tion and mo­men­tum of ev­ery atom com­pris­ing the Earth. The un­known in this model is the ini­tial con­di­tions—the ini­tial po­si­tion and mo­men­tum of ev­ery par­ti­cle (also par­ti­cle iden­tity, i.e. which el­e­ment/​iso­tope each is, but we’ll ig­nore that). Now, imag­ine how many of the pos­si­ble ini­tial con­di­tions are com­pat­i­ble with any par­tic­u­lar high-level data we ob­serve. It’s a mas­sive num­ber!

Point is: the de­ter­minis­tic part of a model of a fun­da­men­tal phys­i­cal model is the dy­nam­ics; the ini­tial con­di­tions are still gen­er­ally un­known. Con­cep­tu­ally, when we fit the data, we’re mostly look­ing for ini­tial con­di­tions which match. So zero like­li­hoods aren’t re­ally an is­sue; the is­sue is com­put­ing with a joint dis­tri­bu­tion over po­si­tion and mo­men­tum of so many par­ti­cles. That’s what statis­ti­cal me­chan­ics is for.

whilst statis­ti­cal av­er­ag­ing has got hu­man mod­el­lers a cer­tain dis­tance, adding rep­re­sen­ta­tions of pro­cesses whose effects get missed by the av­er­ag­ing seems to add a lot of value

The cor­re­spond­ing prob­lem in statis­ti­cal me­chan­ics is to iden­tify the “state vari­ables”—the low-level vari­ables whose av­er­ages cor­re­spond to macro­scopic ob­serv­ables. For in­stance, the ideal gas law uses den­sity, ki­netic en­ergy, and force on con­tainer sur­faces (whose macro­scopic av­er­ages cor­re­spond to den­sity, tem­per­a­ture, and pres­sure). Fluid flow, rather than av­er­ag­ing over the whole sys­tem, uses den­sity and par­ti­cle ve­loc­ity within each lit­tle cell of space.

The point: if an effect is “missed by av­er­ag­ing”, that’s usu­ally not in­her­ent to av­er­ag­ing as a tech­nique. The prob­lem is that peo­ple av­er­age over poorly-cho­sen fea­tures.

Jaynes ar­gued that the key to choos­ing high-level fea­tures is re­pro­ducibil­ity: what high-level vari­ables do ex­per­i­menters need to con­trol in or­der to get a con­sis­tent re­sult dis­tri­bu­tion? If we con­sis­tently get the same re­sults with­out hold­ing X con­stant (where X in­cludes e.g. ini­tial con­di­tions of ev­ery par­ti­cle), then ap­par­ently X isn’t ac­tu­ally rele­vant to the re­sult, so we can av­er­age out X. Also note that there’s some de­grees of free­dom in what “re­sults” we’re in­ter­ested in. For in­stance, tur­bu­lence has macro­scopic be­hav­ior which de­pends on low-level ini­tial con­di­tions, but the long-term time av­er­age of forces from a tur­bu­lent flow usu­ally doesn’t de­pend on low-level ini­tial con­di­tions—and for en­g­ineer­ing pur­poses, it’s of­ten that time av­er­age which we ac­tu­ally care about.

if the mod­els we can af­ford to com­pute with can’t re­pro­duce the data, then pre­sum­ably they are also not re­pro­duc­ing the cor­rect causal graph ex­actly? And any causal graph we could com­pute with will not be able to re­pro­duce the data?

Once we move away from stat mech and ap­prox­i­ma­tions of low-level mod­els, yes, this be­comes a prob­lem. How­ever, two coun­ter­points. First, this is the sort of prob­lem where the out­put says “well, the best model is one with like a gazillion edges, and there’s a bunch that all fit about equally well, so we have no idea what will hap­pen go­ing for­ward”. That’s un­satis­fy­ing, but at least it’s not wrong. Se­cond, if we do get that sort of re­sult, then it prob­a­bly just isn’t pos­si­ble to do bet­ter with the high-level vari­ables cho­sen. Go­ing back to re­pro­ducibil­ity and se­lec­tion of high-level vari­ables: if we’ve omit­ted some high-level vari­able which re­ally does im­pact the re­sults we’re in­ter­ested in, then “we have no idea what will hap­pen go­ing for­ward” re­ally is the right an­swer.

• Thanks again.

I think I need to think more about the like­li­hood is­sue. I still feel like we might be think­ing about differ­ent things—when you say “a de­ter­minis­tic model which uses fun­da­men­tal physics”, this would not be in the set of mod­els that we could af­ford to run to make pre­dic­tions for com­plex sys­tems. For the mod­els we could af­ford to run, it seems to me that no choice of ini­tial con­di­tions would lead them to match the data we ob­serve, ex­cept by ex­treme co­in­ci­dence (analo­gous to a sim­ple polyno­mial just hap­pen­ing to pass through all the dat­a­points pro­duced by a much more com­plex func­tion).

I’ve gone through Jaynes’ pa­per now from the link you gave. His point about de­cid­ing what macro­scopic vari­ables mat­ter is well-made. But you still need a model of how the macro­scopic vari­ables you ob­serve re­late to the ones you want to pre­dict. In mod­el­ling at­mo­spheric pro­cesses, sim­ple spa­tial av­er­ag­ing of the fluid dy­nam­ics equa­tions over re­solved spa­tial scales gets you some way, but then chang­ing the form of the func­tion re­lat­ing the fu­ture to pre­sent states (“adding rep­re­sen­ta­tions of pro­cesses” as I put it be­fore) adds ad­di­tional skill. And Jaynes’ pa­per doesn’t seem to say how you should choose this func­tion.

• For the mod­els we could af­ford to run, it seems to me that no choice of ini­tial con­di­tions would lead them to match the data we ob­serve, ex­cept by ex­treme co­in­ci­dence (analo­gous to a sim­ple polyno­mial just hap­pen­ing to pass through all the dat­a­points pro­duced by a much more com­plex func­tion).

Ok, let’s talk about com­put­ing with er­ror bars, be­cause it sounds like that’s what’s miss­ing from what you’re pic­tur­ing.

The usual start­ing point is lin­ear er­ror—we as­sume that er­rors are small enough for lin­ear ap­prox­i­ma­tion to be valid. (After this we’ll talk about how to re­move that as­sump­tion.) We have some mul­ti­vari­ate func­tion - imag­ine that is the full state of our simu­la­tion at some timestep, and calcu­lates the state at the next timestep. The value of in our pro­gram is re­ally just an es­ti­mate of the “true” value ; it has some er­ror . As a re­sult, the value of of in our pro­gram also has some er­ror . As­sum­ing the er­ror is small enough for lin­ear ap­prox­i­ma­tion to hold, we have:

where is the Ja­co­bian, i.e. the ma­trix of deriva­tives of ev­ery en­try of with re­spect to ev­ery en­try of .

Next, as­sume that has co­var­i­ance ma­trix , and we want to com­pute the co­var­i­ance ma­trix of . We have a lin­ear re­la­tion­ship be­tween and so we use the usual for­mula for lin­ear trans­for­ma­tion of co­var­i­ance:

Now imag­ine iter­at­ing this at ev­ery timestep: we com­pute the timestep it­self, then differ­en­ti­ate that timestep, and ma­trix mul­ti­ply our pre­vi­ous un­cer­tainty on both sides by the deriva­tive ma­trix to get the new un­cer­tainty:

Now, a few key things to note:

• For most sys­tems of in­ter­est, that un­cer­tainty is go­ing to grow over time, usu­ally ex­po­nen­tially. That’s cor­rect: in a chaotic sys­tem, if the ini­tial con­di­tions are un­cer­tain, then of course we should be­come more and more un­cer­tain about the sys­tem’s state over time.

• Those for­mu­las only prop­a­gate un­cer­tainty in pre­vi­ous state to un­cer­tainty in the next state. Really, there’s also new un­cer­tainty in­tro­duced at each timestep, e.g. from er­ror in it­self (i.e. due to av­er­ag­ing) or from what­ever’s driv­ing the sys­tem. Typ­i­cally, such er­rors are in­tro­duced as an ad­di­tive term—i.e. we com­pute the co­var­i­ance in in­tro­duced by each source of er­ror, and add them to the prop­a­gated co­var­i­ance ma­trix at each timestep.

• Ac­tu­ally stor­ing the whole co­var­i­ance ma­trix would take space if has el­e­ments, which is com­pletely im­prac­ti­cal when is the whole state of a finite el­e­ment simu­la­tion. We make this prac­ti­cal the same way we make all ma­trix op­er­a­tions prac­ti­cal in nu­mer­i­cal com­put­ing: ex­ploit spar­sity/​struc­ture. This is ap­pli­ca­tion-spe­cific, but usu­ally the co­var­i­ance can be well-ap­prox­i­mated as the sum of sparse “lo­cal” co­var­i­ances and low-rank “global” co­var­i­ances.

• Like­wise with the up­date: we don’t ac­tu­ally want to com­pute the n-by-n deriva­tive ma­trix and then ma­trix-mul­ti­ply with the co­var­i­ance. Most back­prop­a­ga­tion libraries ex­pose the deriva­tive as a lin­ear op­er­a­tor rather than an ex­plicit ma­trix, and we want to use it that way. Again, speci­fics will vary, de­pend­ing on the struc­ture of and of the (ap­prox­i­mated) co­var­i­ance ma­trix.

• In many ap­pli­ca­tions, we have data com­ing in over time. That data re­duces our un­cer­tainty ev­ery time it comes in—at that point, we effec­tively have a Kal­man filter. If enough data is available, the un­cer­tainty re­mains small enough for the lin­ear ap­prox­i­ma­tion to con­tinue to hold, and whole thing works great.

• If the un­cer­tainty does be­come too large for lin­ear ap­prox­i­ma­tion, then we need to re­sort to other meth­ods for rep­re­sent­ing un­cer­tainty, rather than just a co­var­i­ance ma­trix. Par­ti­cle filters are one sim­ple-but-effec­tive fal­lback, and can be com­bined with lin­ear un­cer­tainty as well.

In gen­eral, if this sounds in­ter­est­ing and you want to know more, it’s cov­ered in a lot of differ­ent con­texts. I first saw most of it in an au­tonomous ve­hi­cles course; be­sides robotics, it’s also heav­ily used in eco­nomic mod­els, and some­times sys­tems/​con­trol the­ory courses will fo­cus on this sort of stuff.

Is this start­ing to sound like a model for which the ob­served data would have nonzero prob­a­bil­ity?

• Do you mean you’d be adding the prob­a­bil­ity dis­tri­bu­tion with that co­var­i­ance ma­trix on top of the mean pre­dic­tion from f, to make it a prob­a­bil­is­tic pre­dic­tion? I was talk­ing about de­ter­minis­tic pre­dic­tions be­fore, though my text doesn’t make that clear. For prob­a­bil­is­tic mod­els, yes adding an un­cer­tainty dis­tri­bu­tion may make re­sult in non-zero like­li­hoods. But if we know the true dy­nam­ics are de­ter­minis­tic (pre­tend there’s no quan­tum effects, which are largely ir­rele­vant for our pre­dic­tion er­rors for sys­tems in the clas­si­cal physics do­main), then we still know the model is not true, and so it seems difficult to in­ter­pret p if we were to do Bayesian up­dat­ing.

Like­li­hoods are also not ob­vi­ously (to me) very good mea­sures of model qual­ity for chaotic sys­tems, ei­ther—in these cases we know that even if we had the true model, its pre­dic­tions would di­verge from re­al­ity due to er­rors in the ini­tial con­di­tion es­ti­mates, but it would trace out the cor­rect at­trac­tor—and its the at­trac­tor ge­om­e­try (con­di­tional on bound­ary con­di­tions) that we’d seem to re­ally want to as­sess. Per­haps then it would have a higher like­li­hood than ev­ery other model, but it’s not ob­vi­ous to me, and it’s not ob­vi­ous that there’s not a bet­ter met­ric for lead­ing to good in­fer­ences when we don’t have the true model.

Ba­si­cally the logic that says to use Bayes for de­duc­ing the truth does not seem to carry over in an ob­vi­ous way (to me) to the case when we want to pre­dict but can’t use the true model.

• But if we know the true dy­nam­ics are de­ter­minis­tic (pre­tend there’s no quan­tum effects, which are largely ir­rele­vant for our pre­dic­tion er­rors for sys­tems in the clas­si­cal physics do­main), then we still know the model is not true, and so it seems difficult to in­ter­pret p if we were to do Bayesian up­dat­ing.

Ah, that’s where we need to ap­ply more Bayes. The un­der­ly­ing sys­tem may be de­ter­minis­tic at the macro­scopic level, but that does not mean we have perfect knowl­edge of all the things which effect the sys­tem’s tra­jec­tory. Most of the un­cer­tainty in e.g. a weather model would not be quan­tum noise, it would be things like ini­tial con­di­tions, mea­sure­ment noise (e.g. how close is this mea­sure­ment to the ac­tual av­er­age over this whole vol­ume?), ap­prox­i­ma­tion er­rors (e.g. from dis­cretiza­tion of the dy­nam­ics), driv­ing con­di­tions (are we ac­count­ing for small vari­a­tions in sun­light or tidal forces?), etc. The true dy­nam­ics may be de­ter­minis­tic, but that doesn’t mean that our es­ti­mates of all the things which go into those dy­nam­ics have no un­cer­tainty. If the in­puts have un­cer­tainty (which of course they do), then the out­puts also have un­cer­tainty.

The main point of prob­a­bil­is­tic mod­els is not to han­dle “ran­dom” be­hav­ior in the en­vi­ron­ment, it’s to quan­tify un­cer­tainty re­sult­ing from our own (lack of) knowl­edge of the sys­tem’s in­puts/​pa­ram­e­ters.

Like­li­hoods are also not ob­vi­ously (to me) very good mea­sures of model qual­ity for chaotic sys­tems, ei­ther—in these cases we know that even if we had the true model, its pre­dic­tions would di­verge from re­al­ity due to er­rors in the ini­tial con­di­tion es­ti­mates, but it would trace out the cor­rect at­trac­tor...

Yeah, you’re point­ing to an im­por­tant is­sue here, al­though it’s not ac­tu­ally like­li­hoods which are the prob­lem—it’s point es­ti­mates. In par­tic­u­lar, that makes lin­ear ap­prox­i­ma­tions a po­ten­tial is­sue, since they’re im­plic­itly ap­prox­i­ma­tions around a point es­ti­mate. Some­thing like a par­ti­cle filter will do a much bet­ter job than a Kal­man filter at trac­ing out an at­trac­tor, since it ac­counts for non­lin­ear­ity much bet­ter.

Any­way, rea­son­ing with like­li­hoods and pos­te­rior dis­tri­bu­tions re­mains valid re­gard­less of whether we’re us­ing point es­ti­mates. When the sys­tem is chaotic but has an at­trac­tor, the pos­te­rior prob­a­bil­ity of the sys­tem state will end up smeared pretty evenly over the whole at­trac­tor. (Although with enough fine-grained data, we can keep track of roughly where on the at­trac­tor the sys­tem is at each time, which is why Kal­man-type mod­els work well in that case.)

• So when we say it is “too costly to model from first prin­ci­ples”, we should keep in mind that we don’t mean the true model space can’t even be writ­ten down effi­ciently.

I’m con­fused. Are you re­ally claiming that mod­el­ing the Earth’s cli­mate can be writ­ten down “effi­ciently”? What ex­actly do you mean by ‘effi­ciently’? What would a sketch of an effi­cient de­scrip­tion of the “true model space” for the Earth’s cli­mate be?

• Ex­treme an­swer: just point AIXI at wikipe­dia. That’s a bit tongue-in-cheek, but it illus­trates the con­cepts well. The ac­tual mod­els (i.e. AIXI) can be very gen­eral and com­pact; rather than AIXI, a speci­fi­ca­tion of low-level physics would be a more re­al­is­tic model to use for cli­mate. Most of the com­plex­ity of the sys­tem is then learned from data—i.e. his­tor­i­cal weather data, a topo map of the Earth, com­po­si­tion of air/​soil/​wa­ter sam­ples, etc. An ex­act Bayesian up­date of a low-level phys­i­cal model on all that data should be quite suffi­cient to get a solid cli­mate model; it wouldn’t even take an un­re­al­is­tic amount of data (data already available on­line would likely suffice). The prob­lem is that we can’t effi­ciently com­pute that up­date, or effi­ciently rep­re­sent the up­dated model—we’re talk­ing about a joint dis­tri­bu­tion over po­si­tions and mo­menta of ev­ery par­ti­cle com­pris­ing the Earth, and that’s even be­fore we ac­count for quan­tum. But the prior dis­tri­bu­tion over po­si­tions and mo­menta of ev­ery par­ti­cle we can rep­re­sent eas­ily—just use some­thing max­en­tropic, and the data will be enough to figure out the (rele­vant parts of the) rest.

So to an­swer your spe­cific ques­tions:

• the “true model space” is just low-level physics

• by “effi­ciently”, I mean the code would be writable by a hu­man and the “train­ing” data would eas­ily fit on your hard drive

• Can we re­duce the is­sue of “we can’t effi­ciently com­pute that up­date” by adding sen­sors?

What if we could get more data ? —— if fac­ing such type of difficul­ties, I would ask that ques­tion first.

• Yeah, the usual mechanism by which more data re­duces com­pu­ta­tional difficulty is by di­rectly iden­ti­fy­ing the val­ues some pre­vi­ously-la­tent vari­ables. If we know the value of a vari­able pre­cisely, then that’s easy to rep­re­sent; the difficult-to-rep­re­sent dis­tri­bu­tions are those where there’s a bunch of vari­ables whose un­cer­tainty is large and tightly cou­pled.

• No, he’s refer­ring to some­thing like perform­ing a Bayesian up­date over all com­putable hy­pothe­ses – that’s in­com­putable (i.e. even in the­ory). It’s in­finitely be­yond the ca­pa­bil­ities of even a quan­tum com­puter the size of the uni­verse.

Think of it as a kind of (the­o­ret­i­cal) ‘up­per bound’ on the prob­lem. None of the ac­tual com­putable (i.e. on real-world com­put­ers built by hu­mans) ap­prox­i­ma­tions to AIXI are very good in prac­tice.

• The AIXI thing was a joke; a Bayesian up­date on low-level physics with un­known ini­tial con­di­tions would be su­per­ex­po­nen­tially slow, but it cer­tainly isn’t un­com­putable. And the dis­tinc­tion does mat­ter—un­com­putabil­ity usu­ally in­di­cates fun­da­men­tal bar­ri­ers even to ap­prox­i­ma­tion, whereas su­per­ex­po­nen­tial slow­ness does not (at least in this case).

• That’s what I thought you might have meant.

In a sense, ex­ist­ing cli­mate mod­els are already “low-level physics” ex­cept that “low-level” means coarse ag­gre­gates of cli­mate/​weather mea­sure­ments that are so big that they don’t in­clude trop­i­cal cy­clones! And, IIRC, those mod­els are so ex­pen­sive to com­pute that they can only be com­puted on su­per­com­put­ers!

But I’m still con­fused as to whether you’re claiming that some­one could im­ple­ment AIXI and feed it all the data you men­tioned.

the prior dis­tri­bu­tion over po­si­tions and mo­menta of ev­ery par­ti­cle we can rep­re­sent eas­ily—just use some­thing max­en­tropic, and the data will be enough to figure out the (rele­vant parts of the) rest.

You seem to be claiming that “Wikipe­dia” (or all of the sci­en­tific data ever mea­sured) would be enough to gen­er­ate “the prior dis­tri­bu­tion over po­si­tions and mo­menta of ev­ery par­ti­cle” and that this data would eas­ily fit on a hard drive. Or are you claiming that such an effi­cient rep­re­sen­ta­tion ex­ists in the­ory? I’m still skep­ti­cal of the lat­ter.

The prob­lem is that we can’t effi­ciently com­pute that up­date, or effi­ciently rep­re­sent the up­dated model—we’re talk­ing about a joint dis­tri­bu­tion over po­si­tions and mo­menta of ev­ery par­ti­cle com­pris­ing the Earth, and that’s even be­fore we ac­count for quan­tum.

This makes me be­lieve that you’re refer­ring to some kind of the­o­ret­i­cal al­gorithm. I un­der­stood the asker to want­ing some­thing (effi­ciently) com­putable, at least rel­a­tive to ac­tual cur­rent cli­mate mod­els (i.e. some­thing re­quiring no more than su­per­com­put­ers to use).

• But I’m still con­fused as to whether you’re claiming that some­one could im­ple­ment AIXI and feed it all the data you men­tioned.

That was a joke, but com­putable ap­prox­i­ma­tions of AIXI can cer­tainly be im­ple­mented. For in­stance, a log­i­cal in­duc­tor run on all that data would be con­cep­tu­ally similar for our pur­poses.

You seem to be claiming that “Wikipe­dia” (or all of the sci­en­tific data ever mea­sured) would be enough to gen­er­ate “the prior dis­tri­bu­tion over po­si­tions and mo­menta of ev­ery par­ti­cle” and that this data would eas­ily fit on a hard drive.

No, wikipe­dia or a bunch of sci­en­tific data (much less than all the sci­en­tific data ever mea­sured), would be enough data to train a solid cli­mate model from a sim­ple prior over par­ti­cle dis­tri­bu­tions and mo­menta. It would definitely not be enough to learn the po­si­tion and mo­men­tum of ev­ery par­ti­cle; a key point of stat mech is that we do not need to learn the po­si­tion and mo­men­tum of ev­ery par­ti­cle in or­der to make macro­scopic pre­dic­tions. A sim­ple max­en­tropic prior over micro­scopic states plus a (rel­a­tively) small amount of macro­scopic data is enough to make macro­scopic pre­dic­tions.

This makes me be­lieve that you’re refer­ring to some kind of the­o­ret­i­cal al­gorithm.

The code it­self need not be the­o­ret­i­cal, but it would definitely be su­per­ex­po­nen­tially slow to run. Mak­ing it effi­cient is where stat mech, mul­ti­s­cale mod­el­ling, etc come in. The point I want to make is that the sys­tem’s “com­plex­ity” is not a fun­da­men­tal bar­rier re­quiring fun­da­men­tally differ­ent epistemic prin­ci­ples.

• … wikipe­dia or a bunch of sci­en­tific data (much less than all the sci­en­tific data ever mea­sured), would be enough data to train a solid cli­mate model from a sim­ple prior over par­ti­cle dis­tri­bu­tions and mo­menta. It would definitely not be enough to learn the po­si­tion and mo­men­tum of ev­ery par­ti­cle; a key point of stat mech is that we do not need to learn the po­si­tion and mo­men­tum of ev­ery par­ti­cle in or­der to make macro­scopic pre­dic­tions. A sim­ple max­en­tropic prior over micro­scopic states plus a (rel­a­tively) small amount of macro­scopic data is enough to make macro­scopic pre­dic­tions.

That’s clearer to me, but I’m still skep­ti­cal that that’s in fact pos­si­ble. I don’t un­der­stand how the prior can be con­sid­ered “over par­ti­cle dis­tri­bu­tions and mo­menta”, ex­cept via the the­o­ries and mod­els of statis­ti­cal me­chan­ics, i.e. as­sum­ing that those micro­scopic de­tails can be ig­nored.

The point I want to make is that the sys­tem’s “com­plex­ity” is not a fun­da­men­tal bar­rier re­quiring fun­da­men­tally differ­ent epistemic prin­ci­ples.

I agree with this. But I think you’re elid­ing how much work is in­volved in what you de­scribed as:

Mak­ing it effi­cient is where stat mech, mul­ti­s­cale mod­el­ling, etc come in.

I wouldn’t think that stan­dard statis­ti­cal me­chan­ics would be suffi­cient for mod­el­ing the Earth’s cli­mate. I’d ex­pect fluid dy­nam­ics is also im­por­tant as well as chem­istry, ge­ol­ogy, the dy­nam­ics of the Sun, etc.. It’s not ob­vi­ous to me that statis­ti­cal me­chan­ics would be effec­tive alone in prac­tice.

• Ah… I’m talk­ing about stat mech in a broader sense than I think you’re imag­in­ing. The cen­tral prob­lem of the field is the “bridge laws” defin­ing/​ex­press­ing macro­scopic be­hav­ior in terms of micro­scopic be­hav­ior. So, e.g., de­riv­ing Navier-Stokes from molec­u­lar dy­nam­ics is a stat mech prob­lem. Of course we still need the other sci­ences (chem­istry, ge­ol­ogy, etc) to define the sys­tem in the first place. The point of stat mech is to take low-level laws with lots of de­grees of free­dom, and de­rive macro­scopic laws from them. For very coarse, high-level mod­els, the “low-level model” might it­self be e.g. fluid dy­nam­ics.

I think you’re elid­ing how much work is in­volved in what you de­scribed as...

Yeah, this stuff definitely isn’t easy. As you ar­gued above, the gen­eral case of the prob­lem is ba­si­cally AGI (and also the topic of my own re­search). But there are a lot of ex­ist­ing tricks and the oc­ca­sional rea­son­ably-gen­eral-tool, es­pe­cially in the mul­ti­s­cale mod­el­ling world and in Bayesian stat mech.

• Yes, I don’t think we re­ally dis­agree. My prior (prior to this ex­tended com­ments dis­cus­sion) was that there are lots of won­der­ful ex­ist­ing tricks, but there’s no real short­cut for the fully gen­eral prob­lem and any such short­cut would be effec­tively AGI any­ways.

• cli­mate mod­els are already “low-level physics” ex­cept that “low-level” means coarse ag­gre­gates of cli­mate/​weather mea­sure­ments that are so big that they don’t in­clude trop­i­cal cy­clones!

Just as as aside, a typ­i­cal mod­ern cli­mate model will simu­late trop­i­cal cy­clones as emer­gent phe­nom­ena from the coarse-scale fluid dy­nam­ics, albeit not enough of the most in­tense ones. Though, much smaller trop­i­cal thun­der­storm-like sys­tems are much more crudely rep­re­sented.

• Tan­gen­tial, but now I’m cu­ri­ous… do you know what dis­cretiza­tion meth­ods are typ­i­cally used for the fluid dy­nam­ics? I ask be­cause in­suffi­ciently-in­tense cy­clones sound like ex­actly the sort of thing APIC meth­ods were made to fix, but those are rel­a­tively re­cent and I don’t have a sense for how much adop­tion they’ve had out­side of graph­ics.

• do you know what dis­cretiza­tion meth­ods are typ­i­cally used for the fluid dy­nam­ics?

There’s a mix­ture—finite differenc­ing used to be used a lot but seems to be less com­mon now, semi-La­grangian ad­vec­tion seems to have taken over from that in mod­els that used it, then some work by do­ing most of the com­pu­ta­tions in spec­tral space and ne­glect­ing the small­est spa­tial scales. Re­cently newer meth­ods have been de­vel­oped to work bet­ter on mas­sively par­allel com­put­ers. It’s not my area, though, so I can’t give a very ex­pert an­swer—but I’m pretty sure the peo­ple work­ing on it think hard about try­ing to not smooth out in­tense struc­tures (though, that has to be bal­anced against main­tain­ing nu­mer­i­cal sta­bil­ity).

• How much are ‘graph­i­cal’ meth­ods like APIC in­cor­po­rated el­se­where in gen­eral?

My in­tu­ition has cer­tainly been pumped to the effect that mod­els that mimic vi­sual be­hav­ior are likely to be use­ful more gen­er­ally, but maybe that’s not a widely shared in­tu­ition.

• I would have hoped that was the case, but that’s in­ter­est­ing that both large and small ones are ap­par­ently not so eas­ily emer­gent.

I won­der whether the mod­els are so coarse that the cy­clones that do emerge are in a sense the min­i­mum size. That would read­ily ex­plain the lack of smaller emer­gent cy­clones. Maybe larger ones don’t emerge be­cause the ‘next larger size’ is too big for the mod­els. I’d think ‘scal­ing’ of ed­dies in fluids might be in­for­ma­tive: What’s the small­est eddy pos­si­bly in some fluid? What other eddy sizes are ob­served (or can be mod­eled)?

• What’s the small­est eddy pos­si­bly in some fluid?

Not sure if this was in­tended to be rhetor­i­cal, but a big part of what makes tur­bu­lence difficult is that we see ed­dies at many scales, in­clud­ing very small ed­dies (at least down to the scale that Navier-Stokes holds). I re­mem­ber a strik­ing graphic about the on­set of tur­bu­lence in a pot of boiling wa­ter, in which the ed­dies re­peat­edly halve in size as cer­tain pa­ram­e­ter cut­offs are passed, and the num­ber of ed­dies even­tu­ally di­verges—that’s the on­set of tur­bu­lence.

• Sorry for be­ing un­clear – it was definitely not in­tended to be rhetor­i­cal!

Yes, tur­bu­lence was ex­actly what I was think­ing about. At some small enough scale, we prob­a­bly wouldn’t ex­pect to ‘find’ or be able to dis­t­in­guish ed­dies. So there’s prob­a­bly some min­i­mum size. But then is there any pat­tern or struc­ture to the larger sizes of ed­dies? For (an al­most cer­tainly in­cor­rect) ex­am­ple, maybe all ed­dies are always a mul­ti­ple of the min­i­mum size and the mul­ti­ple is always an in­te­ger power of two. Or maybe there is no such ‘dis­crete quan­ti­za­tion’ of eddy sizes, tho ed­dies always ‘split’ into nested halves (un­der cer­tain con­di­tions).

It cer­tainly seems the case tho that ed­dies aren’t pos­si­ble as emer­gent phe­nom­ena at a scale smaller than the dis­cretiza­tion of the ap­prox­i­ma­tion it­self.

• I won­der whether the mod­els are so coarse that the cy­clones that do emerge are in a sense the min­i­mum size.

It’s not my area, but I don’t think that’s the case. My im­pres­sion is that part of what drives very high wind speeds in the strongest hur­ri­canes is con­vec­tion on the scale of a few km in the eye­wall, so mod­els with that sort of spa­tial re­s­olu­tion can gen­er­ate re­al­is­ti­cally strong sys­tems, but that’s ~20x finer than typ­i­cal cli­mate model re­s­olu­tions at the mo­ment, so it will be a while be­fore we can simu­late those sys­tems rou­tinely (though, some ar­gue we could do it if we had a com­puter cost­ing a few billion dol­lars).

• Thanks! That’s very in­ter­est­ing to me.

It seems like it might be an ex­am­ple of rel­a­tively small struc­tures hav­ing po­ten­tially ar­bi­trar­ily large long-term effects on the state of the en­tire sys­tem.

It could be the case tho that the over­all effects of cy­clones are still statis­ti­cal at the scale of the en­tire planet’s cli­mate.

Re­gard­less, it’s a great ex­am­ple of the kind of thing for which we don’t yet have good gen­eral learn­ing al­gorithms.