Suppose the goal dramatically overvalues some option; then the AI would be willing to pay large (corrected estimated) costs in order to achieve “even larger” (incorrectly estimated) gains.
When I think about solutions to AI alignment, I often think about ‘meaningful reductionism.’ That is, if I can factor a problem into two parts, and the parts don’t actually rely on each other, now I have two smaller problems to solve. But if the parts are reliant on each other, I haven’t really simplified anything yet.
While impact measures feel promising to me as a cognitive strategy (often my internal representation of politeness feels like ‘minimizing negative impact’, like walking on sidewalks in a way that doesn’t startle birds), they don’t feel promising to me as reductionism. That is, if I already had a solution to the alignment problem, then impact measures would likely be part of how I implement that solution, but solving it separately from alignment doesn’t feel like it gets me any closer to solving alignment.
[The argument here I like most rests on the difference between costs and side effects; we don’t want to minimize side effects because that leads to minimizing good side effects also, and it’s hard to specify the difference between ‘side effects’ and ‘causally downstream effects,’ and so on. But if we just tell the AI “score highly on a goal measure while scoring low on this cost measure,” this only works if we specified the goal and the cost correctly.]
But there’s a different approach to AI alignment, which is something more like ‘correct formalisms.’ We talk sometimes about handing a utility function to the robot, or (in old science fiction) providing it with rules to follow, or so on, and by seeing what it actually looks like when we follow that formalism we can figure out how well that formalism fits to what we’re interested in. Utility functions on sensory inputs don’t seem alignable because of various defects (like wireheading), and so it seems like the right formalism needs to have some other features (it might still be a utility function, but it needs to be an utility function over mental representations of external reality in such a way that the mental representation tracks external reality even when you have freedom to alter your mental representation, in a way that we can’t turn into code yet).
So when I ask myself questions like “why am I optimistic about researching impact measures now?” I get answers like “because exploring the possibility space will make clear exactly how the issues link up.” For example, looking at things like relative reachability made it clear to me how value-laden the ontology needs to be in order for a statistical measure on states to be meaningful. This provides a different form-factor for ‘transferring values to the AI’; instead of trying to ask something like “is scenario A or B better?” and train a utility function, I might instead try to ask something like “how different are scenarios A and B?” or “how are scenarios A and B different?” and train an ontology, with the hopes that this makes other alignment problems easier because the types line up somewhat more closely.
[I think even that last example still performs poorly on the ‘meaningful reductionism’ angle, since getting more options for types to use in value loading doesn’t seem like it addresses the core obstacles of value loading, but provides some evidence of how it could be useful or clarify thinking.]
Could GPT2 make a good weird sun twitter? Probably not, but it could at least be a good inspirobot.
It’s trained on the whole corpus of LW comments and replies that got sufficiently high karma; naively I wouldn’t expect a day to make much of a dent in the training data. But there’s an interesting fact about training to match distributions, which is that most measures of distributional overlap (like the KL divergence) are asymmetric; how similar the corpus is to model outputs is different from how similar model outputs are to the corpus. Geoffrey Irving is interested in methods to use supervised learning to do distributional matching the other direction, and it might be the case that comment karma is a good way to do it; my guess is that you’re better off comparing outputs it generates on the same prompt head-to-head and picking which one is more ‘normal,’ and training a discriminator to attempt to mimic the human normality judgment.
In the case of LessWrong, I think the core sequences are around 10,000 words, not sure how big the overall EA corpus is.
This feel like a 100x underestimate; The Sequences clocks in at over a million words, I believe, and it’s not the case that only 1% of the words are core.
It feels like the per-experience costs are more relevant than the lifetime costs, since *also* you have to aggregate the lifetime annoyance. “Is it worth wearing a helmet this time to avoid 2/3rds of a micromort?”
It could be the case that the “get used to it” costs are a single investment, or there are other solutions that might not be worth it for someone who can tolerate a normal helmet but are worth it for habryka.
My closest answer would be something like “in my version of utopia,” although maybe that’s too strong?
I think this implies way too much endorsement. I often find myself editing a document and thinking “in American English, the comma goes inside the quotation marks,” even though “in programming, the period goes outside the quotation marks”.
When someone has an incomplete moral worldview (or one based on easily disprovable assertions), there’s a way in which the truth isn’t “safe” if safety is measured by something like ‘reversibility’ or ‘ability to continue being the way they were.’ It is also often the case that one can’t make a single small change, and then move on; if, say, you manage to convince a Christian that God isn’t real (or some other thing that will predictably cause the whole edifice of their worldview to come crashing down eventually), then the default thing to happen is for them to be lost and alone.
Where to go from there is genuinely unclear to me. Like, one can imagine caring mostly about helping other people grow, in which a ‘reversibility’ criterion is sort of ludicrous; it’s not like people can undo puberty, or so on. If you present them with an alternative system, they don’t need to end up lost and alone, because you can directly introduce them to humanism, or whatever. But here you’re in something of a double bind; it’s somewhat irresponsible to break people’s functioning systems without giving them a replacement, and it’s somewhat creepy if you break people’s functioning systems to pitch your replacement. (And since ‘functioning’ is value-laden, it’s easy for you to think their system needs replacing.)
I think I have this skill, but I don’t know that I could write this guide. Partly this is because there are lots of features about me that make this easier, which are hard (or too expensive) to copy. For example, Michael once suggested part of my emotional relationship to lots of this came from being gay, and thus not having to participate in a particular variety of competition and signalling that was constraining others; that seemed like it wasn’t the primary factor, but was probably a significant one.
Another thing that’s quite difficult here is that many of the claims are about values, or things upstream of values; how can Draco Malfoy learn the truth about blood purism in a ‘safe’ way?
He appears to have had novel ideas in his technical specialty, but his public writings are mostly about old ideas that have insufficient public defense. There, novelty isn’t a virtue (while correctness is).
My sense is that his worldview was ‘very sane’ in the cynical HPMOR!Quirrell sense (and he was one of the major inspirations for Quirrell, so that’s not surprising), and that he was extremely open about it in person in a way that was surprising and exciting.
I think his standout feature was breadth more than depth. I am not sure I could distinguish which of his ideas were ‘original’ and which weren’t. He rarely if ever wrote things, which makes the genealogy of ideas hard to track. (Especially if many people who do write things were discussing ideas with him and getting feedback on them.)
If instead you had to pay for every view (such as in environments where bandwidth costs are expensive, such as interviewing candidates for a job), then you would do the opposite of clickbait, attempting to get people to not ‘click on your content.’ (Or people who didn’t attempt to get their audience to self-screen would lose out because of the costs to those who did.)
The techniques you mention may include focusing on causes and consequences, but they are very solution-oriented.
Focusing, which is an introspective technique, is explicitly not focused on solutions; it’s focused on figuring out what the actual problem is (which generally is more about listening to the complaint than it is about thinking about the environment or how things could be solved). This then helps someone find a solution, but they’re likely not doing that with Focusing.
Sorry for the lack of links on this; it’s stuff that’s covered in the Sequences, but where exactly I’m not sure. I’ll gladly strong upvote a reply to this comment linking to relevant material on these points.
I don’t think it’s in single posts? Like, there’s the Robin Hanson post The Fallacy Fallacy, or JGWeissman’s Catchy Fallacy Name Fallacy, but those are mostly about “here are specific issues with focusing on fallacies” as opposed to “and also here’s how Bayesian epistemology works instead.” If I were to point to a single post, it might be Science Isn’t Strict Enough, which of course is about science instead of about logic, and is doing the “this is how this standard diverges from what seems to be the right standard” argument but in the opposite direction, sort of.
Like Raemon, I want to echo the point that following your intellectual curiosity is probably the best way to do research work, and generally make the most of your energy/time budget. But some specific considerations:
1. What seems important to Vaniver.
I expect that voting systems mostly won’t matter for AI outcomes. It seems like the primary question is whether or not the AI system we make does anything like what we like/endorse (i.e. whether or not existential accidents happen), and the secondary question is whether or not teams coordinated to form a coalition to build such a safe system (or otherwise prevented the creation of unsafe systems). Voting seems mostly useful for aggregating preferences over scarce joint decisions in a bandwidth-sensitive way (“where should the group go to lunch?” as opposed to “what do you personally want to eat?“, or “which of these four candidates should be president?” as opposed to “what are your complete views on politics?“), and the coalition-building problem will likely look more like negotiation (see this paper by Critch as an example of the sort of thing that seems useful to me in that space) and the preference-satisfaction solution in the glorious transhuman future will likely look more like telling Alexa how you want your personal environment to be and not having to worry much about scarcity or joint decision-making.
It’s possible that government policy will be important, and the health of public discourse will be important, but it seems quite unlikely to me that election reforms will have the desired effects in time.
2. Whether it’s the core problem of discourse, or will be sufficient to overcome modern challenges.
It seems like the forces pushing towards political polarization are considerably stronger than just the pressures from electoral systems, and mostly have to do with communication media stuff. Basically, current media technologies push the creation and curation of media closer to the consumer, who has different (and worse) incentives than elites, which leads to a general dumbing-down and coarsening of discourse. Superior election technology seems likely to help broadly-liked centrists defeat people who manage to eke out 51% support and 49% hate, but that doesn’t seem like it’ll fix discussions of cultural hot spots. (Will broadly liked centrists cause American politics to be more sensible on climate change, or the weird mix of negotiations about border security, or so on?)
Figuring out what’s upstream of worsening discourse and pushing on that (or seeking to create more good discourse, or so on) is probably more effective is better public conversations are actually the goal; and even if this effort helps, if it can’t help enough, it may be better to write off the thing that it would help.
3. Whether or not it’s important if it seems important to Vaniver.
There’s a claim in Inadequate Equilibria, specifically the end of Moloch’s Toolbox, which is that there are lots of problems that don’t get solved because there aren’t all that many people who are unbiased and will float to the problem that seems most important (the ‘maximizing altruists’) compared to the number of problems, and so you get problems that seem ‘quite serious’ but are also neglected because they’re more costly than human civilization can support at present. (This dynamic is common; when I worked in industry, there were many improvements that could be made to the system that weren’t being made because they weren’t the most important improvement to be making at the time.)
But also this sort of meta-work has its own costs. Compare Alice, who views LessWrong on her phone and notices a bug, and then fixes the bug and submits a pull request, and then moves on, with Beatrice, who considers all the bugs on LessWrong and decides which is most important, and then fixes that one and submits a pull request. Then compare both of them with Carol, who also considers all the different projects and tries to figure out which of them is most important, which also maybe requires considering all the different metrics of project importance, which also maybe requires considering all the different decision theories, which also maybe requires...
It seems good for Alice to not pay the costs of optimizing, and just do the local improvements, especially if the alternative is that Alice doesn’t make any improvements. Beatrice will do more important work, but is ‘paying twice’ for it, and in situations where the bugs are roughly equally important this means Beatrice is perhaps less effective than someone less reflective. I think that people who are naturally interested in this sort of maximizing altruism should do it, and people who aren’t (and want to just be Alice instead) should be Alice without worrying about it too much (or trying to convince themselves that, no, they are doing the maximizing altruism thing).
You might be thinking of “And the loser is… Plurality Voting” which describes a 2010 voting systems conference, where Approval Voting ended up winning the approval vote. (I do wish they had had the experts vote under a bunch of different systems, but oh well.)
I think I’m largely (albeit tentatively) with Dagon here: it’s not clear that we don’t _want_ our responses to his wrongness to back-propagate into his idea generation. Isn’t that part of how a person’s idea generation gets better?
It is important that Bob was surprisingly right about something in the past; this means something was going on in his epistemology that wasn’t going on in the group epistemology, and the group’s attempt to update Bob may fail because it misses that important structure. Epistemic tenure is, in some sense, the group saying to Bob “we don’t really get what’s going on with you, and we like it, so keep it up, and we’ll be tolerant of wackiness that is the inevitable byproduct of keeping it up.”
That is, a typical person should care a lot about not believing bad things, and the typical ‘intellectual venture capitalist’ who backs a lot of crackpot horses should likely end up losing their claim on the group’s attention. But when the intellectual venture capitalist is right, it’s important to keep their strategy around, even if you think it’s luck or that you’ve incorporated all of the technique that went into their first prediction, because maybe you haven’t, and their value comes from their continued ability to be a maverick without losing all of their claim on group attention.
Link fixed, thanks!
It’s a straightforward application of the Copernican principle. Of course, that is not always the best approach.
I read this as saying something like “This paper only makes sense if facts matter, separate to values.” It’s funny to me that this sentence felt necessary to be written.
I mean, it’s more something like “there’s a shared way in which facts matter,” right? If I mostly think in terms of material consumption by individuals, and you mostly think in terms of human dignity and relationships, the way in which facts matter for both of us is only tenuously related.