You’re welcome, and thanks for the appreciation!
On point 2, I do want to point out that we’ve been funded by CEA, BERI, Eric Rogstad, and myself, and I don’t think the project would have been as good or finished as soon without that support.
It might help if you pointed at the groups you think the asymmetry is between, as I suspect you and SilentCal are imagining different lines here.
I think you see the asymmetry as being between “people who want to punch others” and “people who don’t want to punch others,” as only the first group sees any possible value from punch bug (in the short term*), and SilentCal sees the two people as “the person who saw the bug first” and “the person who didn’t see it,” where the only asymmetries are related to people’s abilities to spot bugs (and thus playing punch bug with the blind would raise these sorts of symmetry concerns).
*There are purported long-term benefits of playing the game, that Duncan describes in his post; in particular, it seems likely to make people more likely to notice cars of a particular type. You could use this to your benefit, as in the case where you’re attempting to get better at noticing motorcycles on the road, because you think that’ll make it less likely that you get into an accident with them, by playing a modified version of punch bug based on that thing.
Indeed, I note that lots of rationalist conversation norms fail to mesh with other conversational norms because rationalists are playing something like punch bug where the equivalent of the Volkswagen are various patterns of reasoning or argument. (“Why are you being mean to me?” “I was just pointing out an error in your thinking—you should feel free to do the same to me too.” “But that only makes sense as a deal if I want the ability to be mean to you with this sort of ‘no criticize back’ rule.”)
We need to defend the need for people to physically interact with the world, and potentially have some of those interactions be unfun, without invoking patterns of behavior that really do lead to terrible things.
I notice some level of confusion here.
Suppose Alice came to me with an argument like the following (about which I will make a meta-level, not an object-level, point; I don’t endorse the entirety of what follows):
1) The life satisfaction of humans in general depends heavily on whether or not they can bring their authentic selves to the social sphere, and this means the satisfaction of LGBT individuals in particular depends on how gender identity and relationships are policed culturally.
2) One method of policing such relationships is ‘gay bashing,’ but note that the perception of counterfactual gay-bashing is perhaps more important than the statistics of actual gay-bashing because people make decisions based on what they perceive their constraints to be. (The actual world could have everyone hiding themselves, and no one getting gay-bashed, which looks safe from the statistical view but doesn’t let us know what would happen if people didn’t hide themselves.)
3) One causal factor leading to gay bashing is the culture of sports and physical fitness, both because of historical factors and practical factors. (For example, it is simply easier for a physically fit athlete to intimidate or injure a normal person.)
4) Therefore, in order to maximize life satisfaction, we need to ban all sports, and all discussion that could lead to a culture of sports.
How would you go about responding to this argument? (Not the logical details or content of your response, but the methodology of how you give it.)
Personally, I point out that I’m a gay man, and thus have a license to discuss the object-level details because I’m sympathetic to the concerns of LGBT individuals, and then proceed with the object-level discussion of the argument. But imagine the hypothetical straight me, or even worse, the hypothetical straight athlete me. It seems like there’s some chilling going on where straight Vaniver is being punished not for actually gay bashing, but for doing anything perceived as the other party as potentially providing cover for gay bashing.
Though here I should take a step back, and look at the phrase “actually gay bashing.” In doing some fact-checking for this comment I discovered that the phrase “gay bashing,” which I had originally heard in the context of physical violence leading to hospitalization or murder, covers both verbal and physical abuse, both actual and threatened. Obviously verbally bullying someone for their sexual orientation is unacceptably cruel, but I find myself wishing there were some obvious threshold (on the level of injury, perhaps, instead of mere verbal vs. physical, which doesn’t carve reality at the joints) and short word to point to “intimidation of gays above this threshold,” so that I could say things like “I am opposed to people who hospitalize gays because they unacceptably damage the social fabric and freedom of expression for gays, while I support people who argue that some methods that reduce hospitalization of gays aren’t worth their other costs because they’re part of how society correctly handles difficult decisions about tradeoffs” with pre-existing categories for both, so that it was an easier sentence to write.
It’s not obvious to me that Alice feels my desire for such a threshold, or would find it convenient if that threshold existed. It seems to me that when Alice follows her own logic, she ends up being convinced that providing cover for gay bashing is bad for the same reasons that gay bashing is bad, even if it’s not as bad. The sort of callousness that I see as probably necessary to make good tradeoffs is of the same kind as the callousness that Alice hates, because it doesn’t oppose the evil she opposes.
So it seems plausible to me that straight Vaniver would not engage with the argument, because the environment around the argument doesn’t give him any place to bring his actual self and actual views. Perhaps he feels a pang of ironic sympathy when considering point (1) of the argument.
I actually would not have generated the substance of the parent comment (or been able to articulate the follow-up explanations) without the pattern-matching described in the analogy you criticized.
This is not a fully formed take yet, but something about this rubs me the wrong way. It seems to me like you’re saying “this reasoning step was correct because it resulted in me reaching conclusion X, which seems correct,” but this doesn’t seem like an adequate response to “this conclusion seems suspect, because it was generated by a reasoning step that seems suspect.” I would expect suspect reasoning steps to be self-reinforcing (because they provide their own support, indirectly, through the conclusions that they make seem convincing), which makes procedural-level injunctions (“hmm, it seems like this reasoning step is suspect for global reasons, even though it looks locally correct”) rather important.
Well, I haven’t seen even a blog post’s worth of effort put into doing something like what I suggested.
I think blog posts are potentially weird measures of effort, here. I also think that this is something that people are interested in doing—I think it’s a component of MIRI’s strategic sketch here, as part 8--but isn’t the sort of thing where we have anything particularly worthwhile to show for it yet.
Perhaps it makes sense to sketch an argument for why none of the standard paradigms satisfy some desideratum? This is kind of what AI Safety Gridworlds did. But it’s more the thing where, say, gradient boosted random forests have more of the ‘transparency’ property in a particular, legalistic way (it’s easier to figure out blame for any particular classification than it would be with a neural net) but not in the way that we actually care about (looking at a gradient boosted random forest, we could figure out if it’s thinking about things in the way that we want it to be thinking about), which might actually be easier with a neural net (because we could look at what neuron activations correspond to).
The healers may not appreciate being asked to work so much harder, just so that the DPSers can work a bit less hard, and “but this benefits the raid” may not suffice to persuade them.
I note also that healers are much less replaceable than DPS are—or at least, that was the way of things when I was playing WoW—and so the maintenance of healer morale is considerably more important for the guild than the maintenance of DPS morale, or potentially even finishing the encounter sooner or more successfully.
Specifically, the salary is for being a teaching assistant or a research assistant, rather than being a student, but everything is structured under the assumption that graduate students will have a relevant part-time job that covers tuition and living expenses.
One reason I don’t like your graph is that I have no idea how to suffer both X and Y at the same time, for the same action.
Imagine an audience with non-overlapping preferences. Suppose you have control over the thermometer, and someone likes the temperature above 20 degrees C, and another likes it below 15 degrees C. There’s no way to get less than 1 person complaining about your choice.
LessWrong is not the place for this sort of complaint, hence the downvotes (including mine).
Note that while the Slack channel has a similar name, it is an independent entity run by Elo, and doesn’t have the same moderation team.
The honest debater can give a whole bunch of RGB pixel values, which even if it doesn’t conclusively establish a lie will make the truth telling strategy have a higher winning probability, which would be enough to make both debaters converge to telling the truth during training.
One thing that I find myself optimizing for is compression (which seems like a legitimate reason for the actual AI debates to talk about subjective impressions as opposed to objective measurements). It seems to me like if the debaters both just provide the judge with the whole image using natural language, then the honest debater is sure to win (both of them provide the judge with an image, both of them tell the judge one pixel to check from the other person, the honest debater correctly identifies a fake pixel and is immune to a similar attack from the dishonest debater). But this only makes sense if talk with the debaters is cheap, and external validation is expensive, which is not the case in the real use cases, where the judge’s time evaluating arguments is expensive (or the debaters have more subject matter expertise than the judge, such that just giving the judge raw pixel values is not enough for the judge to correctly classify the image).
Not sure I understand the part about uncertainty.
Most of my discussion is about the cat vs. dog debate game with humans, where it’s assumed that the two debaters both know what the ground truth is (and the judge could verify it, given the image). But perhaps it is difficult for the debaters to discern the actual ground truth, and there is an honest disagreement—that is, both debaters agree on every pixel value, but think those pixel values add up to different classifications. (The ambiguous cat-dog drawing is such an example for the image classification problem, and one can imagine different classifiers that classify such a drawing differently. Or, with regular images, different classifiers may make different random errors due to incomplete training.) Such honest disagreement is what I mean by ‘uncertainty.’ Ideally, in such a system, the debate will quickly focus the judge’s attention on the core crux and allow them to quickly settle the issue (or determine that it isn’t possible to settle with the information available).
(In the case of advanced intelligence, consider the case where the AI system is proposing a drug for human consumption, where some of its models of human preferences and physiology think that the drug would be net good, and other models think it would be net bad. It seems like a debate-style model would be good at exposing the core disagreements to human supervisors, but that it is highly unlikely that those disagreements could be resolved by the equivalent of checking a single pixel.)
My understanding is each debater can actually reveal many pixels to the judge. See this quote from section 3.2:
That sounds different to me—the point there is that, because you only need a single pixel to catch me in a lie, and any such demonstration of my dishonesty will result in your win, your limit won’t be a true limit; either we you can demonstrate a single pixel where we disagree, which should be enough to establish you as the honest player, or we agree about every pixel (in which case the truth should win).
The situation that the liar would hope to establish is that they successfully tell many small lies. If I say a pixel is medium gray, and you think the pixel is light grey, I can get away with calling it medium gray because revealing a single pixel to the judge (who, absent context, can’t determine whether a pixel is ‘light’ or ‘medium’ in the relevant sense) won’t give the judge enough evidence that I’m lying to settle the debate. Especially since you might be worried that I’m going to tell an escalating series of lies, and thus if you call me out on claiming that the pixel is medium grey instead of light grey, then you won’t be able to call me out on claiming that a different pixel is black when it is in fact white. (This also means, interestingly, that the player who tries to reason about feature patches—like the diagonal pattern of a whisker—is potentially under more suspicion than a player who tries to reason about particular pixels, since it’s easier to lie about contrasts (which can’t be disproven with a single reveal) than it is to lie about pixels.)
Also consider the case of adversarial examples; if I take the reference image, determine the minimal infinity norm perturbation that results in an image of a different class, and then argue with reference to my image, presumably there’s no one pixel we disagree about strongly (because the pixel we disagree about most strongly determines the infinity norm that I tried to minimize), and thus it’s hard to establish a blatant lie.
Especially if we aren’t allowed to talk about RGB values, and instead have to mention subjective colors; in one game that I played, one of the players revealed a “pupil” pixel that was about #404040. This is weak evidence against being a pupil—you’d expect them to be somewhat darker—but the player hadn’t lied at all about that specific pixel’s color, just the interpretation of the color, which the other player demonstrated by revealing that an adjacent “true pupil” pixel was about #080808 (or maybe even full black).
Now, perhaps ‘any lie at all’ is enough to establish who’s the honest party and who’s the liar. But this seems not at all useful in the case of uncertainty, and also not at all useful in the case of pernicious disagreement, where I disagree about no low-level features but dispute all inferences that could be drawn from those features.
In Aumann, you have two Bayesian reasoners who are motivated by believing true things, who because they’re reasoning in similar ways can use the output of the other reasoner’s cognitive process to refine their own estimate, in a way that eventually converges.
Here, the reasoners are non-Bayesian, and so we can’t reach the same sort of conclusions about what they’ll eventually believe. And it seems like this idea relies somewhat heavily on game theory-like considerations, where a statement is convincing not so much because the blue player said it but because the red player didn’t contradict it (and, since they have ‘opposing’ goals, this means it’s true and relevant).
There’s a piece that’s Aumann-like in that it’s asking “how much knowledge can we extract from transferring small amounts of a limited sort of information?”--here, we’re only transferring “one pixel” per person, plus potentially large amounts of discussion about what those pixels would imply, and seeing how much that discussion can get us.
But I think ‘convergence’ is the wrong sort of way to think about it. Instead, it seems more like asking “how much of a constraint on lying is it to have it such that a someone with as much information as you could expose one small fact related to the lie you’re trying to tell?”. It could be the case that this means liars basically can’t win, because their hands are tied behind their backs relative to the truth; or it could be the case that debate between adversarial agents is a fundamentally bad way to arrive at the truth, such that these adversarial approaches can’t get us the sort of trust that we need. (Or perhaps we need some subtle modifications, and then it would work.)
Arbitrary, inspired by reading left to right.
The “valley of death” in scientific contexts is research that is not profitable to do that is in between research that is profitable to do (and thus is where good ideas go to die). In this particular context, it’s feasible to do preliminary studies of whether or not drugs will help with particular biomarkers of aging in old mice, and it’s feasible to do final studies of whether or not drugs are safe and effective in humans, but it’s not profitable to give those drugs to mice as they age and see how long they live (because you can’t immediately turn that into a drug, and everyone will know if it works, so it doesn’t advantage you relative to other companies, and it’s the sort of thing that normal grantmakers won’t fund academics to do).
Hence, nonprofit work to do the science that everyone wishes someone else would pay for.
I’m curating this because I think it makes a valid and subtle mathematical point, of the sort that has direct relevance to thinking about many other topics.
I asked Sarah this a while ago, and her answer (as I recall it) was that SENS is targeting more basic science stuff that’s on the left side of the ‘valley of death,’ and so is orthogonal (and complementary to) the sort of work LRI is doing, and will also likely take longer to result in treatments in humans.
I note that it’s difficult to point out changes that one thinks are outside the current Overton Window, and thus I worry that most discussions of this topic will be tilted towards “we need fifty Stalins!”.
For example, some historical class-based systems had strongly different values of life for the different classes, in a way that seems roughly in line with their military value. (The armored and armed professional warrior can do what they like, and the manual peasant laborer has to stay out of their way.) Then mass armies and cheap, easily learned weapons dramatically changed the military calculus, and equalized the value of people, making mass democracy more sensible than it was historically. (If the peasants outvote the samurai, but the samurai could crush the peasants if it came to fighting, why would the samurai accept the votes of the peasants? If the samurai can out-privilege the peasants, but the peasants can crush the samurai if it came to fighting, why would the peasants accept the privileges of the samurai?) But will this continue to be the case, or will future developments further shift the military calculus?
But a ‘political power flows from the barrel of a gun’ story for the adoption of democracy is not a very flattering story, especially compared to the story of acknowledging the inner light that shines within each person, that democracy is uniquely positioned to listen to.
My current guess is that non-lotus pleasures are those that have some sort of negative feedback loops in them—e.g. eating isn’t a lotus, because once you’re full, it doesn’t feel as good. On the other hand, work can be a lotus, if you reach a state where you don’t want to stop, and there is nothing negative about such “lotus”. So it’s not exactly “compulsion” but a little more general, though all compulsions are also “lotuses”. Is that right?
The core component of the lotus—the reason why Duncan referred to that in the first place—is that the lotus removes you from the plot. What falls into that category depends on what you think the plot is. Someone focused primarily on their life satisfaction will have different views from someone interested in the march of scientific progress, or from someone who is focused primarily on existential risk, or so on.