And I mean the word “maybe” in the above sentence. I am saying the sentence not to express any disagreement, but to play with a conjecture that I am curious about.
Anyway, my reaction to the actual post is:
“Yep, Overconfidence is Deceit. Deceit is bad.”
However, reading your post made me think about how maybe your right to not be deceived is trumped by my right to be incorrect.
I believe that I could not pass your ITT. I believe I am projecting some views onto you, in order engage with them in my head (and publicly so you can engage if you want). I guess I have a Duncan-model that I am responding to here, but I am not treating that Duncan-model as particularly truth tracking. It is close enough that it makes sense (to me) to call it a Duncan-model, but its primary purpose in me is not for predicting Duncan, but rather for being there to engage with on various topics.
I suspect that being a better model would help it serve this purpose, and would like to make it better, but I am not requesting that.
I notice that I used different words in my header “Scott’s model of Duncan’s beliefs,” I think that this reveals something, but it certainly isn’t clear, “belief” is for true things, “models” are toys for generating things.
I think that in my culture, having a not-that-truth-tracking Duncan-model that I want to engage my ideas with is a sign of respect. I think I don’t do that with that many people (more than 10, but less than 50, I think). I also do it with a bunch of concepts, like “Simic,” or “Logical Induction.” The best models according to me are not the ones that are the most accurate, as much as the ones that are most generally applicable. Rounding off the model makes it fit in more places.
However, I can imagine that maybe in your culture it is something like objectification, which causes you to not be taken seriously. Is this true?
If you are curious about what kind of things my Duncan-model says, I might be able to help you built a (Scott’s-Duncan-Model)-Model. In one short phase, I think I often round you off as an avatar of “respect,” but even my bad model has more nuance than just the word “respect”.
I imagine that you are imagining my comment as a minor libel about you, by contributing to a shared narrative in which you are something that you are not. I am sad to the extent that it has that effect. I am not sure what to do about that. (I could send things like this in private messages, that might help).
However, I want to point out that I am often not asking people to update from my claims. That is often an unfortunate side effect. I want to play with my Duncan-model. I want you to see what I build with it, and point out where it is not correctly tracking what Duncan would actually say. (If that is something you want) I also want to do this in a social context. I want my model to be correct, so that I can learn more from it, but I want to relinquish any responsibility for it being correct. (I am up for being convinced that I should take on that responsibility, either as a general principal, or as a cooperative action towards you.)
Feel free to engage or not.
PS: The above is very much responding to my Duncan-model, rather than what you are actually saying. I reread your above comment, and my comment, and it seems like I am not responding to you at all. I still wanted to share the above text with you.
Yep, I totally agree that it is a riff. I think that I would have put it in response to the poll about how important it is for karma to track truth, if not for the fact that I don’t like to post on Facebook.
This comment is not only about this post, but is also a response to Scott’s model of Duncan’s beliefs about how epistemic communities work, and a couple of Duncan’s recent Facebook posts. It is also is a mostly unedited rant. Sorry.
I grant that overconfidence is in a similar reference class as saying false things. (I think there is still a distinction worth making, similar to the difference between lying directly and trying to mislead by saying true things, but I am not really talking about that distinction here.)
I think society needs to be robust to people saying false things, and thus have mechanisms that prevent those false things from becoming widely believed. I think that as little as possible of that responsibility should be placed on the person saying the false things, in order to make it more strategy-proof. (I think that it is also useful for the speaker to help by trying not to say false things, but I am more putting the responsibility in the listener)
I think there should be pockets of society, (e.g. collections of people, specific contexts or events) that can collect true beliefs and reliably significantly decrease the extent to which they put trust in the claims of people who say false things. Call such contexts “rigorous.”
I think that it is important that people look to the output these rigorous contexts when e.g. deciding on COVID policy.
I think it is extremely important that the rigorous pockets of society is not “everyone in all contexts.”
I think that that society is very much lacking reliable rigorous pockets.
I have this model where in a healthy society, there can be contexts where people generate all sorts of false beliefs, but also sometimes generate gold (e.g. new ontologies that can vastly improve the collective map). If this context is generating a sufficient supply of gold, you DO NOT go in and punish their false beliefs. Instead, you quarantine them. You put up a bunch of signs that point to them and say e.g. “80% boring true beliefs 19% crap 1% gold,” then you have your rigorous pockets watch them, and try to learn how to efficiently distinguish between the gold and the crap, and maybe see if they can generate the gold without the crap. However sometimes they will fail and will just have to keep digging through the crap to find the gold.
One might look at lesswrong, and say “We are trying to be rigorous here. Let’s push stronger on the gradient of throwing out all the crap.” I can see that. I want to be able to say that. I look at the world, and I see all the crap, and I want there to be a good pocket that can be about “true=good”, “false=bad”, and there isn’t one. Science can’t do it, and maybe lesswrong can.
Unfortunately, I also look at the world and see a bunch of boring processes that are never going to find gold, Science can’t do it, and maybe lesswrong can.
And, maybe there is no tradeoff here. Maybe it can do both. Maybe at our current level of skills, we find more gold in the long run by being better and throwing out the crap.
I don’t know what I believe about how much tradeoff there is. I am writing this, and I am not trying to evaluate the claims. I am imagining inhabiting the world where there is a huge trade off. Imagining the world where lesswrong is the closest thing we have to being able to have a rigorous pocket of society, but we have to compromise, because we need a generative pocket of society even more. I am overconfidently imagining lesswrong as better than it is at both tasks, so that the tradeoff feels more real, and I am imagining the world failing to pick up the slack of whichever one it lets slide. I am crying a little bit.
And I am afraid. I am afraid of being the person who overconfidently says “We need less rigor,” and sends everyone down the wrong path. I am also afraid of the person who overconfidently says “We need less rigor,” and gets flagged as a person who says false things. I am not afraid of saying “We need more rigor.” The fact that I am not afraid of saying “We need more rigor” scares me. I think it makes me feel that if I look to closely, I will conclude that “We need more rigor” is true. Specifically, I am afraid of concluding that and being wrong.
In my own head, I have a part of me that is inhabiting the world where there is a large tradeoff, and we need less rigor. I have another part that is trying to believe true things. The second part is making space for the first part, and letting it be as overconfident as it wants. But it is also quarantining the first part. It is not making the claim that we need more space and less rigor. This quarantine action has two positive effects. It helps the second part have good beliefs, but it also protects the first part from having to engage with hammer of truth until it has grown.
I conjecture that to the extent that I am good at generating ideas, it is partially because I quarantine, but do not squash, my crazy ideas. (Where ignoring the crazy ideas counts as squashing them) I conjecture further that in ideal society needs to do similar motions at the group level, not just the individual level. I said at the beginning that you need to put the responsibility for distinguishing in the listener for strategyproofness. This was not the complete story. I conjecture that you need to put the responsibility in the hand of the listener, because you need to have generators that are not worried about accidentally having false/overconfident beliefs. You are not supposed to put policy decisions in the hands of the people/contexts that are not worried about having false beliefs, but you are supposed to keep giving them attention, as long as they keep occasionally generating gold.
Personal Note: If you have the attention for it, I ask that anyone who sometimes listens to me keeps (at least) two separate buckets: one for “Does Scott sometimes say false things?” and one for “Does Scott sometimes generate good ideas?”, and decide whether to give me attention based on these two separate scores. If you don’t have the attention for that, I’d rather you just keep the second bucket, I concede the first bucket (for now), and think my comparative advantage is the be judged according the the second one, and never be trusted as epistemically sound. (I don’t think I am horrible at being epistemically sound, at least in some domains, but if I only get a one dimensional score, I’d rather relinquish the right to be epistemically trusted, in order to absolve myself of the responsibility to not share false beliefs, so my generative parts can share more freely.)
Unedited stream of thought:
Before trying to answer the question, I’m just gonna say a bunch of things that might not make sense (either because I am being unclear or being stupid).
So, I think the debate example is much more *about* manipulation, than the iterated amplification example, so I was largely replying to the class that includes IA and debate, I can imagine saying that Iterated amplification done right does not provide an incentive to manipulate the human.
I think that a process that was optimizing directly for finding a fixed point of X=AmplifyH(X) does have an incentive to manipulate the human, however this is not exactly what IA is doing, because it is only having the gradients pass through the first X in the fixed point equation, and I can imagine arguing that the incentive to manipulate comes from having the gradient pass through the second X. If you iterate enough times, I think you might effectively have some optimization juice passing through modifying the second X, but it might be much less. I am confused about how to think about optimization towards a moving target being different from optimization towards finding a fixed point.I think that even if you only look at the effect of following the gradients coming from the effect of changing the first X, you are at least providing an incentive to predict the human on a wide range of inputs. In some cases, your range of inputs might be such there isn’t actually information about the human in the answers, which I think is where you are trying to get with the automated decomposition strategies. If humans have some innate ability to imitate some non-human process, and use that ability to answer the questions, and thinking about humans does not aid in thinking about that non-human process, I agree that you are not providing any incentive to think about the humans. However, it feels like a lot has to go right for that to work.
On the other hand, maybe we just think it is okay to predict, but not manipulate, the humans, while they are answering questions with a lot of common information about humans’ work, which is what I think IA is supposed to be doing. In this case, even if I were to say that there is no incentive to “manipulate the human”, I still argue that there is “incentive to learn how to manipulate the human,” because predicting the human (on a wide range of inputs) is a very similar task to manipulating the human.
Okay, now I’ll try to answer the question. I don’t understand the question. I assume you are talking about incentive to manipulate in the simple examples with permutations etc in the experiments. I think there is no ability to manipulate those processes, and thus no gradient signal towards manipulation of the automated process. I still feel like there is some weird counterfactual incentive to manipulate the process, but I don’t know how to say what that means, and I agree that it does not affect what actually happens in the system.
I agree that changing to a human will not change anything (except via also adding the change where the system is told (or can deduce) that it is interacting with the human, and thus ignores the gradient signal, to do some treacherous turn). Anyway, in these worlds, we likely already lost, and I am not focusing on them. I think the short answer to your question is in practice no, there is no difference, and there isn’t even incentive to predict humans in strong generality, much less manipulate them, but that is because the examples are simple and not trying to have common information with how humans work.
I think that there are two paths to go down of crux opportunities for me here, and I’m sure we could find more: 1) being convinced that there is not an incentive to predict humans in generality (predicting humans only when they are very strictly following a non-humanlike algorithm doesn’t count as predicting humans in generality), or 2) being convinced that this incentive to predict the humans is sufficiently far from incentive to manipulate.
BTW, I would be down for something like a facilitated double crux on this topic, possibly in the form of a weekly LW meetup. (But think it would be a mistake to stop talking about it in this thread just to save it for that.)
I am having a hard time generating any ontology that says:
I don’t see [let’s try to avoid giving models strong incentives to learn how to manipulate humans] as particularly opposed to methods like iterated amplification or debate.
Here are some guesses:
You are distinguishing between an incentive to manipulate real life humans and an incentive to manipulate human models?
You are claiming that the point of e.g. debate is that when you do it right there is no incentive to manipulate?
You are focusing on the task/output of the system, and internal incentives to learn how to manipulate don’t count?
These are just guesses.
(Where by “my true reason”, I mean what feels live to me right now, There is also all the other stuff from the post, and the neglectedness argument)
Yeah, looking at this again, I notice that the post probably failed to communicate my true reason. I think my true reason is something like:
I think that drawing a boundary around good and bad behavior is very hard. Luckily, we don’t have to draw a boundary between good and bad behavior, we need to draw a boundary that has bad behavior on the outside, and *enough* good behavior on the inside to bootstrap something that can get us through the X-risk. Any distinction between good and bad behavior with any nuance seems very hard to me. However the boundary of “just think about how to make (better transistors and scanning tech and quantum computers or whatever) and don’t even start thinking about (humans or agency or the world or whatever)” seems like it might be carving reality at the joints enough for us to be able to watch a system that is much stronger than us and know that it is not misbehaving.
i.e., I think my true reason is not that all reasoning about humans is dangerous, but that it seems very difficult to separate out safe reasoning about humans from dangerous reasoning about humans, and it seems more possible to separate out dangerous reasoning about humans from some sufficiently powerful subset of safe reasoning (but it seems likely that that subset needs to have humans on the outside).
Yeah, I am sad, but not surprised, because I have been trying to push this idea (e.g. at conferences) for a few years.
Guesses as to why I’m failing?I think that we actually undersold the neglectedness point in this post, but I don’t think that is the main problem, I think the main problem is that the post (and I) do not give viable alternatives, its like:
“Have you noticed that the CHAI ontology, the Paul ontology, and basically all the concrete plans for safe AGI are trying to get safety out of superhuman models of humans, and there are plausible worlds in which this is on net actively harmful for safety. Further, the few exceptions to this mostly involve AGI interacting directly with the world agentically in such a way as to create an instrumental incentive for human modeling.”
“Okay, what do we do instead?”
Perhaps it goes better if I gave any concrete plan at all, even if it is unrealistic like:
1. Understand agency/intelligence/optimization/corrigibility to the point where we could do something good and safe if we had unlimited compute, and maybe reliable brain scans. 2. (in parallel) Build safe enough AGI that can do science/engineering to the point of being able to generate plans to turn Jupiter into a computer, without relying on human models at all.3. Turn Jupiter into a computer.4. Do the good and safe thing on the Jupiter computer, or if no better ideas are found, run a literal HCH on the Jupiter computer.
The problem is that draws attention to how unrealistic this plan is, and not on the open question of “What *do* we do instead?”
I am confused, why is it not identical to your other comment?
I think I fixed it. Thanks.
This was annoying to fix, so I just made W nonempty in the intro to the post.
Yep, changed it to ≃.
Fixed at least some of them.