# Goal completion: noise, errors, bias, prejudice, preference and complexity

A pu­ta­tive new idea for AI con­trol; in­dex here.

This is a pre­limi­nary look at how an AI might as­sess and deal with var­i­ous types of er­rors and un­cer­tain­ties, when es­ti­mat­ing true hu­man prefer­ences. I’ll be us­ing the cir­cu­lar rocket model to illus­trate how these might be dis­t­in­guished by an AI. Re­call that the rocket can ac­cel­er­ate by −2, −1, 0, 1, and 2, and the hu­man wishes to reach the space sta­tion (at point 0 with ve­loc­ity 0) and avoid ac­cel­er­a­tions of ±2. In the forth­com­ing, there will gen­er­ally be some noise, so to make the whole thing more flex­ible, as­sume that the space sta­tion is a bit big­ger than usual, cov­er­ing five squares. So “dock­ing” at the space sta­tion means reach­ing {-2,-1,0,1,2} with 0 ve­loc­ity.

The pur­pose of this ex­er­cise is to dis­t­in­guish true prefer­ences from other things that might seem to be prefer­ences from the out­side, but aren’t. Ul­ti­mately, if this works, we should be able to con­struct an al­gorithm that iden­ti­fies prefer­ences, such that any­thing it re­jects is at least ar­guably not a prefer­ence.
So, at least ini­tially, I’ll be iden­ti­fy­ing terms like “bias” with “fits into the tech­ni­cal defi­ni­tion of bias used in this model”; once the defi­ni­tions are re­fined, we can then check whether they cap­ture enough of the con­cepts we wand.
So I’m go­ing to use the fol­low­ing terms to dis­t­in­guish var­i­ous tech­ni­cal con­cepts in this do­main:
1. Noise

2. Error

3. Bias

4. Prejudices

5. Known prejudices

6. Preferences

Here, noise is when the ac­tion se­lected by the hu­man doesn’t lead to the cor­rect ac­tion out­put. Er­ror is when the hu­man se­lects the wrong ac­tion. Bias is when the hu­man is fol­low­ing the wrong plan. A prej­u­dice is a prefer­ence the hu­man has that they would not agree upon if it was brought to their con­scious no­tice. A know prej­u­dice is a prej­u­dice the hu­man knows about, but can’t suc­cess­fully cor­rect within them­selves.
And a prefer­ence is a prefer­ence.
What char­ac­ter­is­tics would al­low the AI to dis­t­in­guish be­tween these? Note that part of the rea­son that the terms are non-stan­dard is that I’m not start­ing with perfectly clear con­cepts and at­tempt­ing to dis­t­in­guish them; in­stead, I’m find­ing way of dis­t­in­guish­ing var­i­ous con­cepts, and see­ing if these map on well to the con­cepts we care about.

## Noise ver­sus prefer­ence and complexity

In this model, noise is seen as 5% chance of un­der-ac­cel­er­at­ing, ie a de­sired ac­cel­er­a­tion of ±2 will, 5% of the time, give an ac­cel­er­a­tion of ±1. And a de­sired ac­cel­er­a­tion of ±1 will, 5% of the time, give an ac­cel­er­a­tion of zero.
The hu­man starts from a po­si­tion where, to reach the space sta­tion at zero ve­loc­ity, the best plan is to ac­cel­er­ate for a long while and de­cel­er­ate (by −1) for two turns. Ac­cel­er­at­ing by −2 once would also do the trick, though the hu­man prefers not to do that, ob­vi­ously.
As in the pre­vi­ous post, the AI has cer­tain fea­tures φi to ex­plain the hu­man’s be­havi­our. They are:
* φ0({-2,1,0,1,2},0;-)=1
* φ1(-,-;-)=1
* φ2(-,-;+2)=1
* φ3(-,-;-2)=1
The first fea­ture in­di­cates that the five-square space-sta­tion is in a spe­cial po­si­tion (if the ve­loc­ity is 0). The sec­ond fea­ture (a uni­ver­sal fea­ture) is used to show that wast­ing time is not benefi­cial. The third fea­ture is used to rule out ac­cel­er­a­tions of +2, the last fea­ture those of −2.
Given the tra­jec­tory it’s seen, the AI can con­fi­dently fit φ0, φ1 and φ2 to some es­ti­mate of true re­wards (the hu­man rushes to the space sta­tion, with­out us­ing +2 ac­cel­er­a­tions). How­ever, it doesn’t know what to do with φ3. The hu­man had an op­por­tu­nity to use −2 ac­cel­er­a­tion, but went for two −1s in­stead. There are two op­tions: the hu­man ac­tu­ally wants to avoid −2 ac­cel­er­a­tions, and ev­ery­thing went well. Or the hu­man doesn’t want to avoid them, but the noise forced their de­sired −2 ac­cel­er­a­tion down to −1.
Nor­mally there would be a com­plex­ity prior here, with a three-fea­ture ex­pla­na­tion be­ing the most likely—pos­si­bly still the most likely af­ter mul­ti­ply­ing it by 5% to ac­count for the noise. How­ever, there is a re­cur­ring risk that the AI will un­der­es­ti­mate the com­plex­ity of hu­man de­sires. One way of com­bat­ing this is to not use a com­plex­ity prior, at least up to some “rea­son­able” size of hu­man de­sires. If the four-fea­ture ex­pla­na­tion has more prior weight than the three-fea­ture one, then the φ3 is likely to be used and the AI will see the −1,-1 se­quence as de­liber­ate, not noise.
A warn­ing, how­ever: hu­mans have com­plex prefer­ences, but those prefer­ences are not rele­vant to ev­ery sin­gle situ­a­tion. What about φ4, the hu­man prefer­ence for choco­late, φ5, the hu­man prefer­ence for di­alogue in movies, and φ6, the hu­man prefer­ence for sun­light? None of them would ap­pear di­rectly in this sim­ple model of the rocket equa­tion. And though φ0-φ1-φ2-φ3 is a four-fea­ture model of hu­man prefer­ences, so is φ0-φ1-φ2-φ6 (which is in­dis­t­in­guish­able, in this ex­am­ple, from the three-fea­ture model φ0-φ1-φ2).
So we can’t say “mod­els with four fea­tures are more likely than mod­els with three”; at best we could say “mod­els with three *rele­vant* fea­tures are as likely as mod­els with four *rele­vant* fea­tures”. But, given that, the AI will still con­verge on the cor­rect model of hu­man prefer­ences.
Note that as the amount of data/​tra­jec­to­ries in­creases, the abil­ity of the AI to sep­a­rate prefer­ence from noise in­creases rapidly.

## Er­ror ver­sus bias ver­sus preference

First, set noise to zero. Now imag­ine that the rocket is mov­ing at a rel­a­tive ve­loc­ity such that the ideal strat­egy to reach the space sta­tion is to ac­cel­er­ate by +1 for three more turns, and then de­cel­er­ate by −1 for sev­eral turns un­til it reaches the sta­tion.
Put the noise back up to 5%. Now, the op­ti­mum strat­egy is to start de­cel­er­at­ing im­me­di­ately (since there is a risk of un­der-ac­cel­er­at­ing dur­ing the de­cel­er­a­tion phase). In­stead, the hu­man starts ac­cel­er­at­ing by +1.
There are three pos­si­ble ex­pla­na­tions for this. Firstly, the hu­man may not ac­tu­ally want to dock at the space sta­tion. Se­condly, the hu­man may be bi­ased—over­con­fi­dent, in this case. The hu­man may be­lieve there is no noise (or that it can over­come it through willpower and fancy fly­ing?) and there­fore is fol­low­ing the ideal strat­egy in the no-noise situ­a­tion. Or the hu­man may sim­ply have made an er­ror, do­ing +1 when it meant to do −1.
Th­ese op­tions can be dis­t­in­guished by ob­serv­ing sub­se­quent be­havi­our (and be­havi­our on differ­ent tra­jec­to­ries). If we as­sume the prefer­ences are cor­rect, then a bi­ased tra­jec­tory in­volves the hu­man fol­low­ing the ideal plan for an in­cor­rect noise value, and the des­per­ately ad­just­ing at the end when they re­al­ise their plan won’t work. An er­ror, on the other hand, should re­sult in the hu­man try­ing to undo their ac­tion as best they can (say, by de­cel­er­at­ing next turn rather than fol­low­ing the +1,+1,-1,-1,… strat­egy of the no-noise world).
Th­ese are not sharp dis­tinc­tions (es­pe­cially on sin­gle tra­jec­tory or a small set of them). Maybe the hu­man has a prefer­ence for odd ma­noeu­vres as it ap­proaches the space sta­tion. Maybe it makes a mis­take ev­ery turn, and purely co­in­ci­den­tally fol­lows the right tra­jec­tory. And so on.
But this is about the most likely (sim­plest) ex­pla­na­tion. Does the hu­man show all the signs of be­ing a com­pe­tent seeker of a par­tic­u­lar goal, ex­cept that some­times they seem to do com­pletely ran­dom things, which they then try and re­pair (or shift to a com­pletely al­ter­nate strat­egy if the ran­dom ac­tion is not re­pairable)? Most likely an er­ror.
Is the hu­man be­havi­our best ex­plained by sim­ple goals, but a flaw in the strat­egy? This could hap­pen if the over­con­fi­dent hu­man always ac­cel­er­ated too fast, and then did some odd ma­noeu­vres back and forth to dock with the sta­tion. This could be odd prefer­ences for the dock­ing pro­ce­dure, but a larger set of tra­jec­to­ries could rule this out: some­times, the over­con­fi­dent hu­man will ar­rive perfectly at the sta­tion. In that case, they will *not* perform the back and forth dance, re­veal­ing that that be­havi­our was a re­sult of a flawed strat­egy (bias) rather than odd prefer­ences.
A sub­tlety in dis­t­in­guish­ing bias is when the hu­man (or maybe the sys­tem its in) uses meta-ra­tio­nal­ity to cor­rect for the bias. Maybe the hu­man is still over­con­fi­dent, but has picked up a va­ri­ety of habits that com­pen­sate for that over­con­fi­dence. How would the AI in­ter­pret some var­i­ant of overly pru­dent ap­proach phase, fol­lowed by wildly reck­less late ma­noeu­vring (when er­rors are eas­ier to com­pen­sate for)? This is not clear, and re­quires more thought.

## Prefer­ence ver­sus prej­u­dice (and bias)

This is the most tricky dis­tinc­tion of all—how would you dis­t­in­guish a prej­u­dice from a true prefer­ence? One way of ap­proach­ing it is to see if pre­sent­ing the same in­for­ma­tion in differ­ent ways makes a differ­ence.
This can be at­tempted with bias as well. Sup­pose the hu­man’s re­luc­tance for ±2 ac­cel­er­a­tions is due to a bias that causes them to fear that the rocket will fall apart at those ac­cel­er­a­tions, but that bias isn’t ac­cu­rate. Then the AI can re­port ei­ther “we have an ac­cel­er­a­tion of +2” or “we have the high­est safe ac­cel­er­a­tion”. Both are say­ing the same thing, but the hu­man will be­have differ­ently in ei­ther, re­veal­ing some­thing about what is prefer­ence and what is bias.
What about prej­u­dice? Racism and sex­ism are ob­vi­ous ex­am­ples, but it’s more com­mon than that. Sup­pose the pi­lot listens to opera mu­sic while fly­ing, and un­con­sciously presses down harder on the ac­cel­er­a­tor while listen­ing to “ride of the Valkyries”. This fits perfectly into the prej­u­dice for­mat: it’s a prefer­ence that the pi­lot would want to re­move if they were in­formed about it.
To test this, the AI could offer to in­form the hu­man pi­lot of the mu­sic se­lec­tion when the pi­lot was plan­ning the flight (pos­si­bly at some small price). If the pi­lot had a gen­uine prefer­ence for “fly­ing fast when listen­ing to Wag­ner”, then this mu­sic se­lec­tion is rele­vant for their plan­ning, and they’d cer­tainly want it. If the prej­u­dice was un­con­scious, how­ever, they would see no in­ter­est in see­ing the mu­sic se­lec­tion at this point.
Once a prej­u­dice is iden­ti­fied, the AI then has the op­tion of ask­ing the hu­man di­rectly if they agree with it (thus up­grad­ing it to a true but un­known prefer­ence).

## Known prejudices

Some­times, peo­ple have prej­u­dices, know about them, don’t like them, but can’t avoid them. They might then have very com­pli­cated meta-be­havi­ours to avoid fal­ling prey to them. To use the Wag­ner ex­am­ple, some­one try­ing to re­press that would seem to have the dou­ble prefer­ences “I pre­fer to never listen to Wag­ner while fly­ing” and “if, how­ever, I do hear Wag­ner, I pre­fer to fly faster”, when in fact nei­ther of these are prefer­ences.
It would seem that the sim­plest would be to have peo­ple list their un­de­sired prej­u­dices. But apart from the risk they could for­get some of them, their state­ments might be in­cor­rect. They could say “I don’t want to want to fly faster when I hear opera”, while in re­al­ity only Wag­ner causes that in them. So fur­ther anal­y­sis is re­quired be­yond sim­ply col­lect­ing these state­ments.

## Re­vis­it­ing complexity

In a pre­vi­ous post, I ex­plored the idea of giv­ing the AI some vague idea of the size and com­plex­ity of hu­man prefer­ences, and that it should aim in that size for its ex­pla­na­tions. How­ever, I pointed out a trade­off: if the size was too large, the AI would la­bel prej­u­dices or bi­ases as prefer­ences, while if the size was too small, it would ig­nore gen­uine prefer­ences.
If there are ways of dis­t­in­guish­ing bi­ases and prej­u­dices from gen­uine prefer­ences, though, then there is no trade-off. Just put the ex­pected com­plex­ity for *com­bined* hu­man prefer­ences+prej­u­dices+bi­ases at some num­ber, and let al­gorithm sort out what is prefer­ence and what isn’t. It is likely much eas­ier to es­ti­mate the size of hu­man prefer­ences+pseudo-prefer­ences, than it is to iden­tify the size of true prefer­ences (that might vary more from hu­man to hu­man, for start).
I wel­come com­ments, and will let you know if this re­search an­gle goes any­where.
• This seems like a time to bring up in­for­ma­tion tem­per­a­ture. After all, there is the deep par­allel of en­tropy in in­for­ma­tion the­ory and physics. When com­par­ing mod­els, by what fac­tor do you pe­nal­ize a model for re­quiring more in­for­ma­tion to spec­ify it? That would be analo­gous to the in­verse tem­per­a­ture. I have yet to en­counter a case where it makes sense in in­for­ma­tion the­ory, though.

Also, an­other ex­pla­na­tion of the ex­tra +1 is that the risk of hav­ing to use a −2 doesn’t seem that scary—it is not a very strong prefer­ence. If the penalty for a −2 was 10 while 1, 0, or 1 was 1, then as long as the prob­a­bil­ity of need­ing to hit −2 to stay on the sta­tion is less than 11% and it saves a turn, go­ing for the ex­tra +1 seems like a good move. If the penalty is smaller − 4, say—then even a fat­ter risk seems rea­son­able.

• How is in­verse tem­per­a­ture a penalty on mod­els? If you’re refer­ring to the in­verse tem­per­a­ture in the Maxwell-Boltz­mann dis­tri­bu­tion, the tem­per­a­ture is con­sid­ered a con­stant, and it gives the like­li­hood of a par­ti­cle hav­ing a par­tic­u­lar con­figu­ra­tion, not the like­li­hood of a dis­tri­bu­tion.

Also, I’m not sure it’s clear what you mean by “in­for­ma­tion to spec­ify [a model]”. Does a high in­verse tem­per­a­ture mean a model re­quires more in­for­ma­tion, be­cause it’s more sen­si­tive to small changes and there­fore de­rives more in­for­ma­tion from them, or does it mean that the model re­quires less in­for­ma­tion, be­cause it de­rives less in­for­ma­tion from in­puts?

The en­tropy of the Maxwell-Boltz­mann dis­tri­bu­tion I think is pro­por­tional to log-tem­per­a­ture, so high tem­per­a­ture (low sen­si­tivity to in­puts) is preferred if you go strictly by that. Peo­ple that train neu­ral net­works gen­er­ally do this as well to pre­vent over­train­ing, and they call it reg­u­lariza­tion.

If you are refer­ring to the en­tropy of a model, you pe­nal­ize a dis­tri­bu­tion for re­quiring more in­for­ma­tion by se­lect­ing the dis­tri­bu­tion that max­i­mizes en­tropy sub­ject to what­ever in­var­i­ants your model must abide by. This is typ­i­cally done through the method of La­grange mul­ti­pli­ers.

• You as­sign a prob­a­bil­ity of a microstate ac­cord­ing to its en­ergy and the tem­per­a­ture. The den­sity of states at var­i­ous tem­per­a­tures cre­ates very non­triv­ial be­hav­ior (es­pe­cially in solid-state sys­tems).

You ap­pear to know some­what more about fit­ting than I do—as I un­der­stood it, you as­sign a prob­a­bil­ity of a spe­cific model ac­cord­ing to its in­for­ma­tion con­tent and the ‘tem­per­a­ture’. The in­for­ma­tion con­tent would be, if your model is a curvefit with four pa­ram­e­ters, all of which are held to a nar­row range, that has more 13 in­for­ma­tion than a fit with three pa­ram­e­ters held to a similar range.

In pure in­for­ma­tion the­ory, the in­for­ma­tion re­quire­ment is ex­actly steady with the den­sity of states. One bit per bit, no mat­ter what. If you’re just pick­ing out max­i­mum en­tropy, then you don’t need to re­fer to a tem­per­a­ture.

I was think­ing about a penalty-per-bit that is higher than 12 - a stronger prefer­ence for smaller mod­els than break­ing-even. Ab­solute Zero would be when you don’t care about the ev­i­dence, you’re go­ing with a 0 bit model.

• It’s true that the prob­a­bil­ity of a microstate is de­ter­mined by en­ergy and tem­per­a­ture, but the Maxwell-Boltz­mann equa­tion as­sumes that tem­per­a­ture is con­stant for all par­ti­cles. Tem­per­a­ture is a dis­t­in­guish­ing fea­ture of two dis­tri­bu­tions, not of two par­ti­cles within a dis­tri­bu­tion, and least-tem­per­a­ture is not a state that sys­tems tend to­wards.

As an aside, the canon­i­cal en­sem­ble that the Maxwell-Boltz­mann dis­tri­bu­tion as­sumes is only ap­pli­ca­ble when a given state is ex­ceed­ingly un­likely to be oc­cu­pied by mul­ti­ple par­ti­cles. The strange be­hav­ior of con­densed mat­ter that I think you’re refer­ring to (Bose-Ein­stein con­den­sates) is a con­se­quence of this as­sump­tion be­ing in­cor­rect for bosons, where a stars-and-bars model is more ap­pro­pri­ate.

It is not true that in­for­ma­tion the­ory re­quires the con­ser­va­tion of in­for­ma­tion. The Is­ing Model, for ex­am­ple, al­lows for par­ti­cle sys­tems with cy­cles of non-unity gain. This effec­tively means that it al­lows par­ti­cles to act as am­plifiers (or damp­en­ers) of in­for­ma­tion, which is a clear vi­o­la­tion of in­for­ma­tion con­ser­va­tion. This is the ba­sis of crit­i­cal phe­nom­ena, which is a widely ac­cepted area of study within statis­ti­cal me­chan­ics.

I think you mi­s­un­der­stand how mod­els are fit in prac­tice. It is not stan­dard prac­tice to de­ter­mine the ab­solute in­for­ma­tion con­tent of in­put, then to re­lay that in­for­ma­tion to var­i­ous ex­plana­tors. The in­for­ma­tion con­tent of in­put is de­ter­mined rel­a­tive to ex­plana­tors. How­ever, there are train­ing meth­ods that at­tempt to re­duce the rel­a­tive in­for­ma­tion trans­ferred to ex­plana­tors, and this prac­tice is called reg­u­lariza­tion. The penalty-per-rel­a­tive-bit ap­proach is taken by a method called “dropout”, where a ran­dom “cold” model is trained on each train­ing sam­ple, and the fi­nal model is a “heated” ag­gre­gate of the cold mod­els. “Heat­ing” here just means cut­ting the amount of in­for­ma­tion trans­ferred from in­put to ex­plana­tor by some frac­tion.

• It’s true that the prob­a­bil­ity of a microstate is de­ter­mined by en­ergy and tem­per­a­ture, but the Maxwell-Boltz­mann equa­tion as­sumes that tem­per­a­ture is con­stant for all par­ti­cles. Tem­per­a­ture is a dis­t­in­guish­ing fea­ture of two dis­tri­bu­tions, not of two par­ti­cles within a dis­tri­bu­tion, and least-tem­per­a­ture is not a state that sys­tems tend to­wards.

I know. Models are not par­ti­cles. They are dis­tri­bu­tions over out­comes. They CAN be the triv­ial dis­tri­bu­tions over out­comes (X will hap­pen).

I was not refer­ring to ei­ther form of de­gen­er­ate gas in any of my posts here, and I’m not sure why I would give that im­pres­sion. I also did not use any con­ser­va­tion of in­for­ma­tion, though I can see why you would think I did, when I spoke of the in­for­ma­tion re­quire­ment. I meant sim­ply that if you add 1 bit of in­for­ma­tion, you have added 1 bit of en­tropy—as op­posed to in a phys­i­cal sys­tem, where the Fermi shell at, say, 10 meV can have much more or less en­tropy than the Fermi shell at 5meV.

• I thought you were refer­ring to de­gen­er­ate gases when you men­tioned non­triv­ial be­hav­ior in solid state sys­tems since that is the most ob­vi­ous case where you get be­hav­ior that can­not be eas­ily ex­plained by the “ob­vi­ous” model (the canon­i­cal en­sem­ble). If you were think­ing of some­thing else, I’m cu­ri­ous to know what it was.

I’m hav­ing a hard time pars­ing your sug­ges­tion. The “dropout” method in­tro­duces en­tropy to “the model it­self” (the con­di­tional prob­a­bil­ities in the model), but it seems that’s not what you’re sug­gest­ing. You can also in­tro­duce en­tropy to the in­puts, which is an­other com­mon thing to do dur­ing train­ing to make the model more ro­bust. There’s no way to in­tro­duce 1 bit of en­tropy per “1 bit of in­for­ma­tion” con­tained in the in­put though since there’s no way to mea­sure the amount of in­for­ma­tion con­tained in the in­put with­out already hav­ing a model of the in­put. I think sys­tem­at­i­cally in­ject­ing noise into the in­put based on a given model is not func­tion­ally differ­ent from in­ject­ing noise into the model it­self, at least not in the ideal case where the noise is in­jected evenly.

You said that “if you add 1 bit of in­for­ma­tion, you have added 1 bit of en­tropy”. I can’t tell if you’re equat­ing the two phrases or if you’re sug­gest­ing adding 1 bit of en­tropy for ev­ery 1 bit of in­for­ma­tion. In ei­ther case, I don’t know what it means. In­for­ma­tion and en­tropy are nega­tions of one an­other, and the two have op­pos­ing effects on cer­tainty-of-an-out­come. If you’re equat­ing the two, then I sus­pect you’re refer­ring to some­thing spe­cific that I’m not see­ing. If you’re sug­gest­ing adding en­tropy for a given amount of in­for­ma­tion, it may help if you ex­plain which prob­a­bil­ities are im­pacted. To which prob­a­bil­ities would you sug­gest adding en­tropy, and which prob­a­bil­ities have in­for­ma­tion added to them?

• 1) any non-triv­ial Den­sity of States, es­pe­cially for semi­con­duc­tors for the van Hove sin­gu­lar­i­ties.

2) I don’t mean a model like ‘con­sider an FCC lat­tice pop­u­lated by one of 10 types of atoms. Here are the tran­si­tion rates...’ such that the model is made of microstates and you need to do statis­tics to get prob­a­bil­ities out. I mean a model more like ‘Each cigarette smoked in­creases the an­nual risk of lung can­cer by 0.001%’ so the out­put is sim­ply a dis­tri­bu­tion over out­comes, nat­u­rally (these in­clude the oth­ers as spe­cial cases)

In par­tic­u­lar, I’m work­ing un­der the toy meta-model that mod­els are pro­grams that out­put a prob­a­bil­ity dis­tri­bu­tion over bit­streams; these are their pre­dic­tions. You mea­sure re­al­ity (pro­duc­ing some ac­tual bit­stream) and ad­just the prob­a­bil­ity of each of the mod­els ac­cord­ing to the prob­a­bil­ity they gave for that bit­stream, us­ing Bayes’ the­o­rem.

3) I may have mi­sused the term. I mean, the cost in en­tropy to pro­duce that pre­cise bit-stream. Start­ing from a ran­dom bit­stream, how many mea­sure­ments do you have to use to turn it into, say, 1011011100101 with xor op­er­a­tions? One for each bit. Doesn’t mat­ter how many bits there are—you need to mea­sure them all.

When you con­sider mul­ti­ple mod­els, you weight them as a func­tion of their in­for­ma­tion, prefer­ring shorter ones. A.k.a. Oc­cam’s ra­zor. Nor­mally, you re­duce the prob­a­bil­ity by 12 for each bit re­quired. Pprior(model) ~ 2^-N, and you sum only up to the num­ber of bits of ev­i­dence you have. This last clause is a bit of a hack to keep it nor­mal­iz­able (see be­low)

I drew a com­par­i­son of this to tem­per­a­ture, where you have a prob­a­bil­ity penalty of e^-E/​kT on each microstate. You can have any value here be­cause the num­ber of microstates per en­ergy range (the den­sity of states) does not in­crease ex­po­nen­tially, but usu­ally quadrat­i­cally, or some­times less (over short en­ergy ranges, some­times it is more).

If you fol­low the anal­ogy back, the num­ber of bit­streams does in­crease ex­po­nen­tially as a func­tion of length (dou­bles each bit), so the prior prob­a­bil­ity penalty for length must be at least as strong as 12 to avoid in­finitely-long pro­grams be­ing preferred. But, you can use a stronger ex­po­nen­tial die­off—let’s say, 2.01^(-N) - and sud­denly the dis­tri­bu­tion is already nor­mal­iz­able with no need for a spe­cial hack. What par­tic­u­lar value you put in there will be your e^1/​kT equiv­a­lent in the anal­ogy.

• 2) I think this is the dis­tinc­tion you are try­ing to make be­tween the lat­tice model and the smoker model: in the lat­tice model, the equa­tions and pa­ram­e­ters are defined, whereas in the smoker model, the equa­tions and pa­ram­e­ters have to be de­duced. Is that right? If so, my pre­vi­ous posts were refer­ring to the smoker-type model.

Your toy meta-model is con­sis­tent with what I was think­ing when I used the word “model” in my pre­vi­ous com­ments.

3) I see what you’re say­ing. If you add com­plex­ity to the model, you want to make sure that its im­prove­ment in abil­ity is greater than the amount of com­plex­ity added. You want to make sure that the model isn’t just “mem­o­riz­ing” the cor­rect re­sults, and that all model com­plex­ity comes with some benefit of gen­er­al­iz­abil­ity.

I don’t think tem­per­a­ture is the right anal­ogy. What you want is to pe­nal­ize a model that is too gen­er­ally ap­pli­ca­ble. Here is a sim­ple case:

sim­ple case A one-hid­den-layer feed-for­ward bi­nary stochas­tic neu­ral net­work the goal of which is to find bi­nary-vec­tor rep­re­sen­ta­tions of its bi­nary-vec­tor in­puts. It trans­lates its in­put to an in­ter­nal rep­re­sen­ta­tion of length n, then trans­lates that in­ter­nal rep­re­sen­ta­tion into some bi­nary-vec­tor out­put that is the same length as its in­put. The er­ror func­tion is the re­con­struc­tion er­ror, mea­sured as the KL-di­ver­gence from in­put to out­put.

The “com­plex­ity” you want is the length of its in­ter­nal rep­re­sen­ta­tion in unit bits since each el­e­ment of the in­ter­nal rep­re­sen­ta­tion can re­tain at most one bit of in­for­ma­tion, and that bit can be ar­bi­trar­ily re­flected by the in­put. The in­for­ma­tion loss is the same as the re­con­struc­tion er­ror in unit bits since that de­scribes the prob­a­bil­ity of the model guess­ing cor­rectly on a given in­put stream (as­sum­ing each bit is in­de­pen­dent). Your crite­rion trans­lates to “min­i­mize re­con­struc­tion er­ror + in­ter­nal rep­re­sen­ta­tion size”, and this can be done by re­peat­edly in­creas­ing the size of the in­ter­nal rep­re­sen­ta­tion un­til adding one more el­e­ment re­duces re­con­struc­tion er­ror by less than one bit.

• 2) I think this is the dis­tinc­tion you are try­ing to make be­tween the lat­tice model and the smoker model: in the lat­tice model, the equa­tions and pa­ram­e­ters are defined, whereas in the smoker model, the equa­tions and pa­ram­e­ters have to be de­duced. Is that right? If so, my pre­vi­ous posts were refer­ring to the smoker-type model.

Well, the real thing is that (again in the toy meta­model) you con­sider the com­plete en­sem­ble of smoker-type mod­els and let them fight it out for good scores when com­pared to the ev­i­dence. I guess you can con­sider this pro­cess to be de­duc­tion, sure.

3) (in re­sponse to the very end) That would be at the point where 1 bit of in­ter­nal rep­re­sen­ta­tion costs 12 of prior prob­a­bil­ity. If it was ‘min­i­mize (re­con­struc­tion er­ror + 2*rep­re­sen­ta­tion size)’ then that would be a ‘tem­per­a­ture’ half that, where 1 more bit of in­ter­nal rep­re­sen­ta­tion costs a fac­tor of 14 in prior prob­a­bil­ity. Colder thus cor­re­sponds to want­ing your mod­els smaller at the ex­pense of ac­cu­racy. Sort of back­wards from the usual way tem­per­a­ture is used in simu­lated an­neal­ing of MCMC sys­tems.

• I see. You’re treat­ing “en­ergy” as the in­for­ma­tion re­quired to spec­ify a model. Your anal­ogy and your ear­lier posts make sense now.