# Charlie Steiner

Karma: 953
NewTop
Page 1
• Not sure what your meetup con­tent is, or how you feel the real crite­ria for some­one fit­ting in are. Are you go­ing to talk about sci­ence, or tech­nol­ogy, or philos­o­phy, or are you go­ing to to do some kind of ex­er­cise or group ac­tivity, or are you just go­ing to hang out?

For mee­tups I’ve run in the past, I think the most im­por­tant crite­rion of fit is that some­one should en­joy train­ing their cog­ni­tive skills (which was usu­ally the meat of the mee­tups), and en­joy­ment of LW sub­cul­ture (“did you see X?” be­ing a good way to have a fun con­ver­sa­tion /​ hang out) was an im­por­tant sec­ondary qual­ity.

• I strongly agree, but I think the for­mat of the thing we get, and how to ap­ply it, are still go­ing to re­quire more thought.

Hu­man val­ues as they ex­ist in­side hu­mans are go­ing to ex­ist na­tively as sev­eral differ­ent, per­haps con­flict­ing, ways of judg­ing hu­man in­ter­nal ways of rep­re­sent­ing the world. So first you have to make a model of a hu­man, and figure out how you’re go­ing to lo­cate in­ten­tional-stance el­e­ments like “rep­re­sen­ta­tion of the world.” Then you run into on­tolog­i­cal crises from mov­ing the hu­man’s mod­els and judg­ments into some com­mon, more ac­cu­rate model (that an AI might use). Get the wrong an­swer in one of these on­tolog­i­cal crises, and the mod­eled util­ity func­tion may as­sign high value to some­thing we would re­gard as de­cep­tive, or as wire­head­ing the hu­man (such re­ac­tions might give some hints to­wards how we want to re­solve such on­tolog­i­cal crises).

Once we’re com­par­ing hu­man judg­ments on a level play­ing field, we can still run into prob­lems of con­flicts, prob­lems of cir­cu­lar­ity, and other weird meta-level con­flicts where we don’t value some val­ues that I’m not sure how to ad­dress in a prin­ci­pled way. But sup­pose we com­press these judg­ments into one util­ity func­tion within the larger model. Are we then done? I’m not sure.

• I’m not sure that the agent that con­stantly twitches is go­ing to be mo­ti­vated by co­her­ence the­o­rems any­ways. Is the class of agents that care about co­her­ence iden­ti­cal to the class of po­ten­tially dan­ger­ous goal-di­rected/​ex­plicit-util­ity-max­i­miz­ing/​in­sert-eu­phemism-here agents?

• When think­ing about agents, the first mo­ti­va­tion might not quite work out. Small changes in ob­ser­va­tion might in­tro­duce dis­con­tin­u­ous changes in policy—e.g. in the Match­ing Pen­nies game. Sup­pose there are agents (func­tions) in that out­put a fixed , no mat­ter their in­put. If you can con­tin­u­ously vary by mov­ing in , then Match­ing Pen­nies play will be dis­con­tin­u­ous at . So right away you’ve com­mit­ted to some un­usual be­hav­ior for the agents in by ask­ing for con­ti­nu­ity—they can’t play perfect Match­ing Pen­nies at the very least.

• Be­cause the noise usu­ally grows as the sig­nal does. Con­sider Moore’s law for tran­sis­tors per chip. Back when that num­ber was about 10^4, the stan­dard de­vi­a­tion was also small—say 10^3. Now that den­sity is 10^8, no chips are go­ing to be within a thou­sand tran­si­a­tors of each other, the stan­dard de­vi­a­tion is much big­ger (~10^7).

This means that if you’re try­ing to fit the curve, be­ing off by 10^5 is a small mis­take when pre­duct­ing cur­rent tran­sis­tor #, but a huge mis­take when pre­dict­ing past tran­sis­tor #. It’s not rare or im­plau­si­ble now to find a chip with 10^5 more tran­sis­tors, but back in the ’70s that differ­ence is a huge er­ror, im­pos­si­ble un­der an ac­cu­rate model of re­al­ity.

A ba­sic fit­ting func­tion, like least squares, doesn’t take this into ac­count. It will trade off tran­sis­tors now vs. tran­sis­tors in the past as if the mis­takes were of ex­actly equal im­por­tance. To do bet­ter you have to use some­thing like a chi squared method, where you ex­plic­itly weight the points differ­ently based on their var­i­ance. Or fit on a log scale us­ing the sim­ple method, which effec­tively as­sumes that the noise is pro­por­tional to the sig­nal.

• When try­ing to fit an ex­po­nen­tial curve, don’t weight all the points equally. Or if you’re us­ing ex­cel and just want the easy way, take the log of your val­ues and then fit a straight line to the logs.

• Ah, it started so well. And then the num­bered list started, and you didn’t use any of the things from be­fore the list at all! You as­sumed some new things (1, 2 and 3) that con­tained your en­tire con­clu­sion.

Let me try to redi­rect you just a lit­tle.

Sup­pose we flip a coin and hide it un­der a cup with­out look­ing at it. We should bet as if the coin has P(Heads)=0.5, be­cause when we are ig­no­rant we can’t do bet­ter than as­sign­ing a prob­a­bil­ity, even though the re­al­ity is fixed. In fact, the same ar­gu­ment ap­plies be­fore flip­ping the coin if we ig­nore quan­tum effects—the uni­verse is already ar­ranged such that the coin will land heads or tails, but be­cause we don’t know which, we as­sign a prob­a­bil­ity.

Now sup­pose that you get to look at the coin, while I don’t. Now you should as­sign P(Heads)=1 if it is heads, and P(Heads)=0 if it is tails, but I should still as­sign P(Heads)=0.5. Differ­ent peo­ple can as­sign differ­ent prob­a­bil­ities, and that’s okay.

The Sleep­ing Beauty prob­lem has two per­spec­tives—Sleep­ing Beauty’s view, and the ex­per­i­menter’s view (or god’s view). In these two views, you face differ­ent con­straints. To Sleep­ing Beauty, she is spe­cial and she knows that cer­tain log­i­cal re­la­tion­ships hold be­tween the al­lowed day and the state of the coin. To the ex­per­i­menter, the coin and the day are in­de­pen­dent vari­ables, and no in­stance of Sleep­ing Beauty is spe­cial.

(note: if you think the day be­ing Mon­day is an “in­valid” ob­serv­able, just sup­pose that there is a cal­en­dar out­side the room and Sleep­ing Beauty is pre­dict­ing what she will see when she checks the cal­en­dar, much like how we pre­dicted what we would see when we looked at the flipped coin.)

Every­one thinks that as­sign­ing prob­a­bil­ities from the ex­per­i­menter’s view is easy, but they dis­agree about Sleep­ing Beauty’s view.

Here’s a trick that tells you about what bet­ting odds Sleep­ing Beauty should as­sign, us­ing only the easy ex­per­i­menter’s view! Just sup­pose that the ex­per­i­menter is bet­ting money against Sleep­ing Beauty—ev­ery time Sleep­ing Beauty wakes up she makes this bet. Every dol­lar won by Sleep­ing Beauty is lost by the ex­per­i­menter. What is a fair price for Sleep­ing Beauty to pay, in ex­change for the ex­per­i­menter pay­ing her $1.00 if the day is Mon­day? We don’t need to use Sleep­ing Beauty’s view to an­swer this ques­tion. We just use the fact that the ex­per­i­menter’s view is easy, and the bet is fair if the ex­per­i­menter doesn’t gain or lose any money on av­er­age, from the ex­per­i­menter’s view. With prob­a­bil­ity 0.5 (for the ex­per­i­menter) Sleep­ing Beauty only wakes up on Mon­day, and with prob­a­bil­ity 0.5 (for the ex­per­i­menter) she wakes up on both Mon­day and Tues­day and makes the bet both times. So with prob­a­bil­ity 0.5 the ex­per­i­menter pays a dol­lar and gets the fair price, and with prob­a­bil­ity 0.5 the ex­per­i­menter pays a dol­lar and gets twice the fair price. In other words, 3 times the fair price = 2 dol­lars. The fair price for a bet that pays Sleep­ing Beauty on Mon­day is$2/​3.

• Looks pretty in­ter­est­ing. I’m not su­per sold on this be­ing a “nice” busi­ness model, since play­ing con­structed at a com­pet­i­tive level still seems like a multi-hun­dred-dol­lar buy-in that’s only go­ing to in­crease with fur­ther ex­pan­sions. But I like draft­ing any­how, so sure.

I’m also a lit­tle con­cerned about some of the big power differ­ences in heroes, and cer­tain in­stances of early-game RNG—a lit­tle is nec­es­sary but I think things can get un­fun when there’s a high enough var­i­ance and clear enough op­tions that you can tell that you’ve prob­a­bly already won or lost, but have to play it out any­how.

Still, I’ll prob­a­bly get it—I’m more or less done with Slay the Spire (if you like card-based com­bat, puz­zly roguelikes, good bal­ance, and high difficulty, I definitely recom­mend that game, but at this point I’ve beaten A20 with all the dudes, and don’t feel like go­ing for high win­rate), and the game­play videos seem in­ter­est­ing.

Any­one can PM me if they want to talk Ar­ti­fact, I guess?

• Have you read the Blue-Min­i­miz­ing Robot? Early Homo sapi­ens was in the sim­ple en­vi­ron­ment where it seemed like they were “min­i­miz­ing blue,” i.e. max­i­miz­ing ge­netic fit­ness. Now, you might say, it seems like our be­hav­ior in­di­cates prefer­ences for hap­piness, mean­ing, val­i­da­tion, etc, but re­ally that’s just an epiphe­nomenon no more mean­ingful than our pre­vi­ous ap­par­ent prefer­ence for ge­netic fit­ness.

How­ever, there is an im­por­tant differ­ence be­tween us and the blue-min­i­miz­ing robot, which is that we have a much bet­ter model of the world, and within that model of the world we do a much bet­ter job than the robot at mak­ing plans. What kind of plans? The thing that mo­ti­vates our plans is, from a purely func­tional per­spec­tive, our prefer­ences. And this thing isn’t all that differ­ent in mod­ern hu­mans ver­sus hunter-gath­er­ers. We know, we’ve talked to them. There have been some al­ter­a­tions due to biol­ogy and cul­ture, but not as much as there could have been. Hunter-gath­er­ers still like hap­piness, mean­ing, val­i­da­tion, etc.

What seems to have hap­pened is that evolu­tion stum­bled upon a set of in­stincts that pro­duced hu­man plan­ning, and that in the an­ces­tral en­vi­ron­ment this cor­re­lated well with ge­netic fit­ness, but in the mod­ern en­vi­ron­ment this di­verges even though the plan­ning pro­cess it­self hasn’t changed all that much. There are cer­tain fu­tur­is­tic sce­nar­ios that could se­ri­ously dis­rupt the pic­ture of hu­man val­ues I’ve given, but I don’t think it’s the de­fault, par­tic­u­larly if there aren’t any op­ti­miza­tion pro­cesses much stronger than hu­mans run­ning around.

• Hm. I won­der what an “al­ter­na­tive” to neu­ral nets and gra­di­ent de­scent would look like. Neu­ral nets are re­ally just there as a highly ex­pres­sive model class that gra­di­ent de­scent works on.

One big difficulty is that if your model is go­ing to clas­sify pic­tures of cats (or go boards, etc.), it’s go­ing to be pretty darn com­pli­cated, and I’m scep­ti­cal that any choice of model class is go­ing to pre­vent that. But maybe one could try to “hide” this com­plex­ity in a re­cur­sive struc­ture. Neu­ral nets already do this, but con­vnets es­pe­cially mix up spa­tial hi­er­ar­chy with log­i­cal hi­er­ar­chy, and nns in gen­eral aren’t as nicely pack­aged into hu­man-thought-sized pieces as maybe they could be—con­sider resnets, which work well pre­cisely be­cause they aban­don the pre­tense of each neu­ron be­ing some spe­cific hu­man-scale log­i­cal unit.

So maybe you could go the op­po­site di­rec­tion and make that pre­tense a re­al­ity with some kind of model class that tries to en­force “hu­man-thought-sized” reused units with rel­a­tively sparse in­ter-unit con­nec­tions? Could still train with SGD, or treat hy­pothe­ses as de­ci­sion trees and take ad­van­tage of that liter­a­ture.

But sup­pose we got such a model class work­ing, and trained it to rec­og­nize cats. Would it ac­tu­ally be hu­man-com­pre­hen­si­ble? Prob­a­bly not! I guess I’m just not re­ally clear on what “de­signed for trans­parency and al­ignabil­ity” is sup­posed to cash out to at this stage of the game.

• In­ter­est­ing! I’m still con­cerned that, since you need to ag­gre­gate these things in the end any­how (be­cause ev­ery­thing is com­men­su­rable in the met­ric of af­fect­ing de­ci­sions), the ag­gre­ga­tion func­tion is go­ing to be al­lowed to be very com­pli­cated and de­pen­dent on fac­tors that don’t re­spect the sep­a­ra­tion of this tri­chotomy.

But it does make me con­sider how one might try to im­port this into value learn­ing. I don’t think it would work to take these cat­e­gories as given and then try to learn meta-prefer­ences to sew them to­gether, but most (par­tic­u­larly more di­rect) value learn­ing schemes have to start with some “seed” of ex­am­ples. If we draw that seed only from “ap­prov­ing,” does that mean that the trained AI isn’t go­ing to value want­ing or lik­ing enough? Or would ev­ery­thing prob­a­bly be fine, be­cause we wouldn’t ap­prove of bad stuff?

• #8 ac­tu­ally comes up in physics:

in the field of non­lin­ear dy­nam­ics (pretty pic­ture, ac­tual wikipe­dia). The fact that con­tin­u­ous changes in func­tions can lead to sur­pris­ing changes in fixed points (speci­fi­cally sta­ble at­trac­tors) is pretty darn im­por­tant to un­der­stand­ing e.g. phase tran­si­tions!

• Does this work for #7? (and ques­tion) (Spoilers for #6):

I did #6 us­ing 2D Sperner’s lemma and closede­ness. Imag­ine the the des­ti­na­tion points are col­ored [as in #5, which was a nice hint] by where they are rel­a­tive to their source points—split the pos­si­ble differ­ence vec­tors into a col­ored cir­cle as in #5 [pick the cen­ter to be a fourth color so you can no­tice if you ever sam­ple a fixed point di­rectly, but if fixed points are rare this shouldn’t mat­ter], and take sam­ples to make it look like 2d Sperner’s lemma, in which there must be at least one in­te­rior tri-col­ored patch. Define a limit of zoom­ing in that moves you to­wards the tri-col­ored patch, ap­ply closed­ness to say the cen­ter (fixed) point is in­cluded, much like how we were en­couraged to do #2 with 1D Sperner’s lemma.

To do #7, it seems like you just need to show that there’s a con­tin­u­ous bi­jec­tion that pre­serves whether a point is in­te­rior or on the edge, from any con­vex com­pact sub­set of R^2 to any other. And there is in­deed a recipe to do this—it’s like you imag­ine sweep­ing a line across the two shapes, at rates such that they finish in equal time. Ap­ply a 1D trans­for­ma­tion (af­fine will do) at each point in time to make the two cross sec­tions match up and there you are. This uses the prop­erty of con­vex­ity, even though it seems like you should be able to strengthen this the­o­rem to work for sim­ply con­nected com­pact sub­sets (if not—why not?).

EDIT: (It turns out that I think you can con­struct patholog­i­cal shapes with un­countable num­bers of edges for which a sim­ple lin­ear sweep fails no mat­ter the an­gle, be­cause you’re not al­lowed to sweep over an edge of one shape while sweep­ing over a ver­tex of the other. But if we al­low the an­gle to vary slightly with para­met­ric ‘time’, I don’t think there’s any pos­si­ble coun­terex­am­ple, be­cause you can always find a way to start and end at a ver­tex.)

Then once you’ve mapped your sub­set to a tri­an­gle, you use #6. But.

This doesn’t use the hint! And the hints have been so good and ed­u­ca­tional ev­ery­where I’ve used them. So what am I miss­ing about the hint?

• Yeah, I did the same thing :)

Put­ting it right af­ter #2 was highly sug­ges­tive—I won­der if this means there’s some very differ­ent route I would have thought of in­stead, ab­sent the fram­ing.

• Shrug I dunno man, that seems hard :) I just tend to eval­u­ate com­mu­nity norms by how well they’ve worked el­se­where, and gut feel­ing. But nei­ther of these is any sort of di­a­mond-hard proof.

Your ques­tion at the end is pretty gen­eral, and I would say that most chakra-the­o­rists would not want to join this com­mu­nity, so in a sense we’re already mostly avoid­ing chakra-the­o­rists—and there are other groups who are com­pletely un­rep­re­sented. But I think the mechanism is rel­a­tively in­di­rect, and that’s good.

• Con­sider some­thing like pro­tect­ing the free speech of peo­ple you strongly dis­agree with. It can be an em­piri­cal fact (ac­cord­ing to one’s model of re­al­ity) that if just those peo­ple were cen­sored, the dis­cus­sion would in fact im­prove. But such pointlike cen­sor­ship is usu­ally not an op­tion that you ac­tu­ally have available to you—you are go­ing to have un­avoid­able im­pacts on com­mu­nity norms and other peo­ples’ be­hav­ior. And so most peo­ple around here pro­tect some­thing like a prin­ci­ple of free­dom of speech.

If costs are un­avoid­able, then, isn’t that just the nor­mal state of things? You’re think­ing of “harm” as rel­a­tive to some coun­ter­fac­tual state of non-harm—but there are many coun­ter­fac­tual states an on­line dis­cus­sion group could be in that would be very good, and I don’t worry too much about how we’re be­ing “harmed” by not be­ing in those states, ex­cept when I think I see a way to get there from here.

In short, I don’t think I as­so­ci­ate the same kind of nega­tive emo­tion with these kinds of trade­offs that you do. They’re just a fairly or­di­nary part of fol­low­ing a strat­egy that gets good re­sults.

• I like to make the dis­tinc­tion be­tween think­ing the chakra-the­o­rists are valuable mem­bers of the com­mu­nity, and think­ing that it’s im­por­tant to have com­mu­nity norms that in­clude the chakra-the­o­rists.

It’s a lot like the dis­tinc­tion be­tween moral­ity and law. The chakra the­o­rists are prob­a­bly wrong and in fact it prob­a­bly harms the com­mu­nity that they’re here. But it’s not a good way to run a com­mu­nity to kick them out, so we shouldn’t, and in fact we should be as wel­com­ing to them as we think we should be to similar groups that might have similar prima fa­cie silli­ness.

• So, to sum up (?):

We want the AI to take the “right” ac­tion. In the IRL frame­work, we think of get­ting there by a se­ries of ~4 steps - (ob­ser­va­tions of hu­man be­hav­ior) → (in­ferred hu­man de­ci­sion in model) → (in­ferred hu­man val­ues) → (right ac­tion).

Go­ing from step 1 to 2 is hard, and ditto with 2 to 3, and we’ll prob­a­bly learn new rea­sons why 3 to 4 is hard when try to do it more re­al­is­ti­cally. You mostly use model mis-speci­fi­ca­tion to illus­trate this—be­cause very differ­ent mod­els of step 2 can pre­dict similar step 1, the in­fer­ence is hard in a cer­tain way. Be­cause very differ­ent mod­els of step 3 can pre­dict similar step 2, that in­fer­ence is also hard.