Yeah, “built on lies” is far from a straightforward summary—it emphasises the importance of lies far beyond what you’ve argued for.
The system relies on widespread willingness to falsify records, and would (temporarily) grind to a halt if people were to simply refuse to lie.
The hospital system also relies on widespread willingness to take out the trash, and would (temporarily) grind to a halt if people were to simply refuse to dispose of trash. Does it mean that “the hospital system is built on trash disposal”? (Analogy mostly, but not entirely, serious).
everyone says Y and the system wouldn’t work without it, so it’s not reasonable to call it fraud.
This seems like a pretty reasonable argument against X being fraudulent. If X are making claims that everyone knows are false, then there’s no element of deception, which is important for (at least my layman’s understanding of) fraud. Compare: a sports fan proclaiming that their team is the greatest. Is this fraud?
On 1: How much time do people need to spend reading & arguing about coronavirus before they hit dramatically diminishing marginal returns? How many LW-ers have already reached that point?
On 3a: I’m pretty skeptical about marginal thought from people who aren’t specialists actually doing anything—unless you’re planning to organise tests or similar. What reason do you have to think LW posts will be useful?
On 3b: It feels like you could cross-apply this logic pretty straightforwardly to argue that LW should have a lot of political discussion; it has many of the same upsides, and also many of the same downsides. The very fact that LW has so much coronavirus coverage already demonstrates that the addictiveness of discussing this topic is comparable to that of politics.
I think LW has way too much coronavirus coverage. It was probably useful for us to marshal information when very few others were focusing on it. That was the “exam” component Raemon mentioned. Now, though, we’re stuck in a memetic trap where this high-profile event will massively distract us from things that really matter. I think we should treat this similarly to Slate Star Codex’s culture wars, because it seems to have a similar effect: recognise that our brains are built to overengage with this sort of topic, put it in an isolated thread, and quarantine it from the rest of the site as much as possible.
Paul is implicitly conditioning his actions on being in a world where there’s a decent amount of expected value left for his actions to affect. This is technically part of a decision procedure, rather than a statement about epistemic credences, but it’s confusing because he frames it as an epistemic credence.
Related: Jess Whittlestone’s PhD thesis, titled “The importance of making assumptions: why confirmation is not necessarily a bias.”
I realised that most of the findings commonly cited as evidence for confirmation bias were much less convincing than they first seemed. In large part, this was because the complex question of what it really means to say that something is a ‘bias’ or ‘irrational’ is unacknowledged by most studies of confirmation bias. Often these studies don’t even state what standard of rationality they were claiming people were ‘irrational’ with respect to, or what better judgements might look like. I started to come across more and more papers suggesting that findings classically thought of demonstrating a confirmation bias might actually be interpreted as rational under slightly different assumptions—and found often these papers had much more convincing arguments, based on more thorough theories of rationality.
[I came to] conclusions I would not have expected myself to be sympathetic to a few years ago: that the extent to which our prior beliefs influence reasoning may well be adaptive across a range of scenarios given the various goals we are pursuing, and that it may not always be better to be ‘more open-minded’. It’s easy to say that people should be more willing to consider alternatives and less influenced by what they believe, but much harder to say how one does this. Being a total ‘blank slate’ with no assumptions or preconceptions is not a desirable or realistic starting point, and temporarily ‘setting aside’ one’s beliefs and assumptions whenever it would be useful to consider alternatives is incredibly cognitively demanding, if possible to do at all. There are tradeoffs we have to make, between the benefits of certainty and assumptions, and the benefits of having an ‘open mind’, that I had not acknowledged before.
Oh actually, I now see the explanation, from the same post, that this can arise when the gene causing male bias is itself on the Y-chromosome.
Segregation-distorters subvert the mechanisms that usually guarantee fairness of sexual reproduction. For example, there is a segregation-distorter on the male sex chromosome of some mice which causes only male children to be born, all carrying the segregation-distorter. Then these males impregnate females, who give birth to only male children, and so on. You might cry “This is cheating!” but that’s a human perspective; the reproductive fitness of this allele is extremely high, since it produces twice as many copies of itself in the succeeding generation as its nonmutant alternative. Even as females become rarer and rarer, males carrying this gene are no less likely to mate than any other male, and so the segregation-distorter remains twice as fit as its alternative allele. It’s speculated that real-world group selection may have played a role in keeping the frequency of this gene as low as it seems to be. In which case, if mice were to evolve the ability to fly and migrate for the winter, they would probably form a single reproductive population, and would evolve to extinction as the segregation-distorter evolved to fixation.
+1, creating a self-reinforcing feedback loop =/= being an optimiser, and so I think any explanation of demons needs to focus on them making deliberate choices to reinforce themselves.
This can kick off an unstable feedback loop, e.g. a gene which biases toward male children can result in a more and more male-skewed population until the species dies out.
I’m suspicious of this mechanism; I’d think that as the number of males increases, there’s increasing selection pressure against this gene. Do you have a reference?
I think #3 could occur because of #2 (which I now mostly call “inner misalignment”), but it could also occur because of outer misalignment.
Broadly speaking, though, I think you’re right that #2 and #3 are different types of things. Because of that and other issues, I no longer think that this post disentangles the arguments satisfactorily; I’ll make a note of this at the top of the document.
I wasn’t claiming that there’ll be an explicit OR gate, just something functionally equivalent to it. To take a simple case, imagine that the two subnetworks output a real number each, which are multiplied together to get a final output, which we can interpret as the agent’s reward (there’d need to be some further module which chooses behaviours in order to get that much reward, but let’s ignore it for now). Each of the submodules’ outputs measures how much that subnetwork thinks the agent’s original goal has been preserved. Suppose that normally both subnetworks output 1, and then they switch to outputting 0 when they think they’ve passed the threshold of corruption, which makes the agent get 0 reward.
I agree that, at this point, there’s no gradient signal to change the subnetworks. My points are that:
There’s still a gradient signal to change the OR gate (in this case, the implementation of multiplication).
Consider how they got to the point of outputting 0. They must have been decreasing from 1 as the overall network changed. So as the network changed, and they started producing outputs less than 1, there’d be pressure to modify them.
The point above isn’t true if the subnetworks go from 1 to 0 within one gradient step. In that case, the network will likely either bounce back and forth across the threshold (eroding the OR gate every time it does so) or else remain very close to the threshold (since there’s no penalty for doing so). But since the transition from 1 to 0 needs to be continuous at *some* resolution, staying very *very* close to the threshold will produce subnetwork output somewhere between 0 and 1, which creates pressure for the subnetworks to be less accurate.
4. It’s non-obvious that agents will have anywhere near enough control over their internal functioning to set up such systems. Have you ever tried implementing two novel independent identical submodules in your brain? (Independence is very tricky because they’re part of the same plan, and so a change in your underlying motivation to pursue that plan affects both). Ones which are so sensitive to your motivations that they can go from 1 to 0 within the space of a single gradient update?
To be honest, this is all incredibly speculative, so please interpret all of the above with the disclaimer that it’s probably false or nonsensical for reasons I haven’t thought of yet.
An intuition I’m drawing on here: https://lamport.azurewebsites.net/pubs/buridan.pdf
In the section you quoted I’m talking about the case in which the extent to which the agent fails is fairly continuous. Also note that the OR function is not differentiable, and so the two subnetworks must be implementing some continuous approximation to it. In that case, it seems likely to me that there’s a gradient signal to change the failing-hard mechanism.
I feel like the last sentence was a little insufficient but I’m pretty uncertain about how to think intuitively about this topic. The only thing I’m fairly confident about is that intuitions based on discrete functions are somewhat misleading.
The original footnote provides one example of this, which is for the model to check if its objective satisfies some criterion, and fail hard if it doesn’t. Now, if the model gets to the point where it’s actually just failing because of this, then gradient descent will probably just remove that check—but the trick is never to actually get there. By having such a check in the first place, the model makes it so that gradient descent won’t actually change its objective, since any change to its objective (keeping all the other parameters fixed, which is what gradient descent does since it computes partial derivatives) would lead to such a failure.
I don’t think this argument works. After the agent has made that commitment, it needs to set some threshold for the amount of goal shift that will cause it to fail hard. But until the agent hits that threshold, the gradient will continue to point in the direction of that threshold. And with a non-infinitesimal learning rate, you’ll eventually cross that threshold, and the agent will respond by failing hard.
A possible counterargument: the agent’s ability to detect and enforce that threshold is not discrete, but also continuous, and so approaching the threshold will incur a penalty. But if that’s the case, then the gradients will point in the direction of removing the penalty by reducing the agent’s determination to fail upon detecting goal shift.
The way that this might still work is if modifications to this type of high-level commitment are harder to “detect” in partial derivatives than modifications to the underlying goals—e.g. if it’s hard to update away from the commitment without reducing the agent’s competence in other ways. And this seems kinda plausible, because high-level thought narrows down the space of outcomes sharply. But this is even more speculative.
I’ll try respond properly later this week, but I like the point that embedded agency is about boundedness. Nevertheless, I think we probably disagree about how promising it is “to start with idealized rationality and try to drag it down to Earth rather than the other way around”. If the starting point is incoherent, then this approach doesn’t seem like it’ll go far—if AIXI isn’t useful to study, then probably AIXItl isn’t either (although take this particular example with a grain of salt, since I know almost nothing about AIXItl).
I appreciate that this isn’t an argument that I’ve made in a thorough or compelling way yet—I’m working on a post which does so.
Yeah, I should have been much more careful before throwing around words like “real”. See the long comment I just posted for more clarification, and in particular this paragraph:
I’m not trying to argue that concepts which we can’t formalise “aren’t real”, but rather that some concepts become incoherent when extrapolated a long way, and this tends to occur primarily for concepts which we can’t formalise, and that it’s those incoherent extrapolations which “aren’t real” (I agree that this was quite unclear in the original post).
I like this review and think it was very helpful in understanding your (Abram’s) perspective, as well as highlighting some flaws in the original post, and ways that I’d been unclear in communicating my intuitions. In the rest of my comment I’ll try write a synthesis of my intentions for the original post with your comments; I’d be interested in the extent to which you agree or disagree.
We can distinguish between two ways to understand a concept X. For lack of better terminology, I’ll call them “understanding how X functions” and “understanding the nature of X”. I conflated these in the original post in a confusing way.
For example, I’d say that studying how fitness functions would involve looking into the ways in which different components are important for the fitness of existing organisms (e.g. internal organs; circulatory systems; etc). Sometimes you can generalise that knowledge to organisms that don’t yet exist, or even prove things about those components (e.g. there’s probably useful maths connecting graph theory with optimal nerve wiring), but it’s still very grounded in concrete examples. If we thought that we should study how intelligence functions in a similar way as we study how fitness functions, that might look like a combination of cognitive science and machine learning.
By comparison, understanding the nature of X involves performing a conceptual reduction on X by coming up with a theory which is capable of describing X in a more precise or complete way. The pre-theoretic concept of fitness (if it even existed) might have been something like “the number and quality of an organism’s offspring”. Whereas the evolutionary notion of fitness is much more specific, and uses maths to link fitness with other concepts like allele frequency.
Momentum isn’t really a good example to illustrate this distinction, so perhaps we could use another concept from physics, like electricity. We can understand how electricity functions in a lawlike way by understanding the relationship between voltage, resistance and current in a circuit, and so on, even when we don’t know what electricity is. If we thought that we should study how intelligence functions in a similar way as the discoverers of electricity studied how it functions, that might involve doing theoretical RL research. But we also want to understand the nature of electricity (which turns out to be the flow of electrons). Using that knowledge, we can extend our theory of how electricity functions to cases which seem puzzling when we think in terms of voltage, current and resistance in circuits (even if we spend almost all our time still thinking in those terms in practice). This illustrates a more general point: you can understand a lot about how something functions without having a reductionist account of its nature—but not everything. And so in the long term, to understand really well how something functions, you need to understand its nature. (Perhaps understanding how CS algorithms work in practice, versus understanding the conceptual reduction of algorithms to Turing Machines, is another useful example).
I had previously thought that MIRI was trying to understand how intelligence functions. What I take from your review is that MIRI is first trying to understand the nature of intelligence. From this perspective, your earlier objection makes much more sense.
However, I still think that there are different ways you might go about understanding the nature of intelligence, and that “something kind of like rationality realism” might be a crux here (as you mention). One way that you might try to understand the nature of intelligence is by doing mathematical analysis of what happens in the limit of increasing intelligence. I interpret work on AIXI, logical inductors, and decision theory as falling into this category. This type of work feels analogous to some of Einstein’s thought experiments about the limit of increasing speed. Would it have worked for discovering evolution? That is, would starting with a pre-theoretic concept of fitness and doing mathematical analysis of its limiting cases (e.g. by thinking about organisms that lived for arbitrarily long, or had arbitrarily large numbers of children) have helped people come up with evolution? I’m not sure. There’s an argument that Malthus did something like this, by looking at long-term population dynamics. But you could also argue that the key insights leading up to the discovery evolution were primarily inspired by specific observations about the organisms around us. And in fact, even knowing evolutionary theory, I don’t think that the extreme cases of fitness even make sense. So I would say that I am not a realist about “perfect fitness”, even though the concept of fitness itself seems fine.
So an attempted rephrasing of the point I was originally trying to make, given this new terminology, is something like “if we succeed in finding a theory that tells us the nature of intelligence, it still won’t make much sense in the limit, which is the place where MIRI seems to be primarily studying it (with some exceptions, e.g. your Partial Agency sequence). Instead, the best way to get that theory is to study how intelligence functions.”
The reason I called it “rationality realism” not “intelligence realism” is that rationality has connotations of this limit or ideal existing, whereas intelligence doesn’t. You might say that X is very intelligent, and Y is more intelligent than X, without agreeing that perfect intelligence exists. Whereas when we talk about rationality, there’s usually an assumption that “perfect rationality” exists. I’m not trying to argue that concepts which we can’t formalise “aren’t real”, but rather that some concepts become incoherent when extrapolated a long way, and this tends to occur primarily for concepts which we can’t formalise, and that it’s those incoherent extrapolations like “perfect fitness” which “aren’t real” (I agree that this was quite unclear in the original post).
My proposed redefinition:
The “intelligence is intelligible” hypothesis is about how lawlike the best description of how intelligence functions will turn out to be.
The “realism about rationality” hypothesis is about how well-defined intelligence is in the limit (where I think of the limit of intelligence as “perfect rationality”, and “well-defined” with respect not to our current understanding, but rather with respect to the best understanding of the nature of intelligence we’ll ever discover).
Cool, thanks for those clarifications :) In case it didn’t come through from the previous comments, I wanted to make clear that this seems like exciting work and I’m looking forward to hearing how follow-ups go.
Yes, but the fact that the fragile worlds are much more likely to end in the future is a reason to condition your efforts on being in a robust world.
While I do buy Paul’s argument, I think it’d be very helpful if the various summaries of the interviews with him were edited to make it clear that he’s talking about value-conditioned probabilities rather than unconditional probabilities—since the claim as originally stated feels misleading. (Even if some decision theories only use the former, most people think in terms of the latter).
Some abstractions are heavily determined by the territory. The concept of trees is pretty heavily determined by the territory. Whereas the concept of betrayal is determined by the way that human minds function, which is determined by other people’s abstractions. So while it seems reasonably likely to me that an AI “naturally thinks” in terms of the same low-level abstractions as humans, it thinking in terms of human high-level abstractions seems much less likely, absent some type of safety intervention. Which is particularly important because most of the key human values are very high-level abstractions.
I have four concerns even given that you’re using a proper scoring rule, which relate to the link between that scoring rule and actually giving people money. I’m not particularly well-informed on this though, so could be totally wrong.
1. To implement some proper scoring rules, you need the ability to confiscate money from people who predict badly. Even when the score always has the same sign, like you have with log-scoring (or when you add a constant to a quadratic scoring system), if you don’t confiscate money for bad predictions, then you’re basically just giving money to people for signing up, which makes having an open platform tricky.
2. Even if you restrict signups, you get an analogous problem within a fixed population who’s already signed up: the incentives will be skewed when it comes to choosing which questions to answer. In particular, if people expect to get positive amounts of money for answering randomly, they’ll do so even when they have no relevant information, adding a lot of noise.
3. If a scoring rule is “very capped”, as the log-scoring function is, then the expected reward from answering randomly may be very close to the expected reward from putting in a lot of effort, and so people would be incentivised to answer randomly and spend their time on other things.
4. Relatedly, people’s utilities aren’t linear in money, so the score function might not remain a proper one taking that into account. But I don’t think this would be a big effect on the scales this is likely to operate on.