Chris van Merwijk

Karma: 730

Chris van Merwijk Apr 23, 2022, 11:30 AM
LW: 10 AF: 1
0
AF
on: Optimal play in human-judged Debate usually won’t answer your question
Reading this post a while after it was written: I’m not going to respond to the main claim (which seems quite likely) but just to the specific arguments, which seems suspicious to me. Here are some points:
- In my model of the standard debate setup with human judge, the human can just use both answers in whichever way it wants, independently of which it selects as the correct answer. The fact that one answer provides more useful information than “2+2=?” doesn’t imply a “direct” incentive for the human judge to select that as the correct answer. Upon introspection, I myself would probably say that “4” is the correct answer, while still being very interested in the other answer (the answer on AI risk). I don’t think you disagreed with this?
- At a later point you say that the real reason for why the judge would nevertheless select the QIA as the correct answer is that the judge wants to train the system to do useful things. You seem to say that a rational consequentialist would make this decision. Then at a later point you say that this is probably/plausibly (?) a bad thing: “Is this definitely undesirable? I’m not sure, but probably”. But if it really is a bad thing and we can know this, then surely a rational judge would know this, and could just decide not to do it? If you were the judge, would you select the QIA, despite it being “probably undesirable”?
- Given that we are talking about optimal play and the human judge is in fact not rational/safe, the debater could manipulate the judge, and so the previous argument doesn’t in fact imply that judges won’t select QIA’s. The debater could deceive and manipulate the judge into (incorrectly) thinking that it should select the QIA, even if you/we currently believe that this would be bad. I agree this kind of deception would probably happen in optimal play (if that is indeed what you meant), but it relies on the judge being irrational or manipulable, not on some argument that “it is rational for a consequentialist judge to select answers with the highest information value”.
It seems to me that either we think there is no problem with selecting QIA’s as answers, or we think that human judges will be irrational and manipulated, but I don’t see the justification in this post for saying “rational consequentialist judges will select QIA’s AND this is probably bad”.

Chris van Merwijk Apr 6, 2022, 9:11 PM
1 point
0
in reply to: TAG’s comment on: A paradox of existence
yes, but I think your reasoning “If 2 is only talking about the map, it doesn’t imply 3” is too vague. I’d rather not go into it though, because I am currently busy with other things, so I’d suggest letting the reader decide.
Edit: reading back my response, it might come accross as a bit rude. If so, sorry for that, I didn’t mean it that way.

Chris van Merwijk Apr 6, 2022, 8:59 PM
1 point
0
in reply to: TAG’s comment on: A paradox of existence
I think this is too vague, but I will drop this discussion and let the reader decide.

Chris van Merwijk Apr 6, 2022, 8:30 PM
2 points
0
in reply to: TAG’s comment on: A paradox of existence
“But without the premise that the territory is maths, the rest of the paradox doesn’t follow.”
I explicitly said “mathematically describable” implying I am not identifying the theory with reality. Nothing in my “argument” makes this identification

Chris van Merwijk Apr 5, 2022, 7:45 PM
2 points
0
in reply to: Yair Halberstadt’s comment on: A paradox of existence
If an object knows that it exists, then this implies that it actually exists. Moreover, assuming that the state of a brain is a mathematical fact about the mathematical theory, then that the object knows it exists is in principle a mathematical implication of the mathematical theory (if observation 2 is correct). Hence it would be an implication of the theory that that theory describes an existing reality.

Chris van Merwijk Apr 5, 2022, 3:52 PM
3 points
0
in reply to: romeostevensit’s comment on: A paradox of existence
Basically, yes.

Chris van Merwijk Apr 5, 2022, 12:34 PM
1 point
0
in reply to: Rafael Harth’s comment on: A paradox of existence
“There may also be mathematical properties that are universe-specific (the best candidates here are natural constants), but the extent to which these exist is questionable”

The exact position of every atom in the universe at time t=10^10 years is a “mathematical property of our universe” in my terminology. The fact that some human somewhere uttered the words “good morning” at some point today, is a complicated mathematical property of our universe, in principle derivable from the fundamental theory of physics.

Chris van Merwijk Mar 28, 2022, 12:25 PM
2 points
0
in reply to: sanxiyn’s comment on: Manhattan project for aligned AI
tangential comment: Regarding “I will define success as producing fission weapons before the end of war in Europe”. I’m not sure if this is the right criterion for success for the purpose of analogizing to AGI. It seems to me that “producing fission weapons before an Axis power does” is more appropriate.

And this seems overwhelmingly the case, yes: “theory of atomic bomb was considerably more advanced at the beginning of Manhattan project compared to our understanding of theory of aligned AGI”

Chris van Merwijk Mar 28, 2022, 5:17 AM
1 point
0
in reply to: Yitz’s comment on: Manhattan project for aligned AI
I’m not sure I understand the motivation behind question. How much of my modern knowledge am I supposed to throw away? Note I am not in fact an atomic theorist who has the state of knowledge of atomic theory in 1942 so it’s hard to know what I’d think, but I can imagine assigning somewhere between 5% and 95% depending on how informed of an atomic theorist I actually was and what it was actually like in 1942. Maybe I could give a better answer if you clarify the motivation behind the question?

Chris van Merwijk Mar 21, 2022, 4:28 PM
3 points
0
in reply to: Steven Byrnes’s comment on: Natural Value Learning
“I tend to think that learning and following the norms of a particular culture (further discussion) isn’t too hard a problem for an AGI which is motivated to do so”. If the AGI is motivated to do so then the value learning problem is already solved and nothing else matters (in particular my post becomes irrelevant), because indeed it can learn the further details in whichever way it wants. We somehow already managed to create an agent with an internal objective that points to Bedouin culture (human values), which is the whole/complete problem.

I could say more about the rest of your comment but just checking if the above changes your model of my model significantly?

Also regarding “I think I’m much more open-minded than you to …”: to be clear, I’m not at all convinced about this I’m open to this distinction not mattering at all. I hope I didn’t come accross as not open minded about this.

Chris van Merwijk Mar 21, 2022, 7:41 AM
3 points
0
in reply to: Steven Byrnes’s comment on: Natural Value Learning
Not really a fair characterization I think: 2 mostly seems orthogonal to me (though I probably disagree with your claim. i.e. most important things are passed from previous generations. e.g. children learn that theft is bad, racism is bad etc, all of those things are passed from either parents or other adults. I don’t care a lot about the distinction parents vs other adults/society in this case. I know about the research that parenting has little influence, I don’t want to go into it preferably). 1 seems more relevant. In fact maybe the main reason for me to think this post is irrelevant is that the inductive biases in AI systems will be too different from that of humans (although note, genes still alow for a lot of variability in ethics and so on). But I still think it might be a good idea to keep in mind that “information in the brain about values has a higher risk to not get communicated into the training signal if the method of elliciting that information is not adapted to the way humans normally express the information”, if indeed it is true.

Chris van Merwijk Mar 21, 2022, 7:28 AM
3 points
0
in reply to: Charlie Steiner’s comment on: Natural Value Learning
I haven’t specified anything about the algorithms, but they will maybe somehow have to be different. The point is that the format of the human feedback is different. Really this post is about the format in which humans provide feedback rather than about the structure of the AI systems (i.e. a difference in method of generating the training signal rather than a difference in learning algorithm).

Chris van Merwijk Mar 21, 2022, 7:28 AM
3 points
0
in reply to: Dagon’s comment on: Natural Value Learning
The thing underlying the intuition is more something like: We have a method of feedback that humans understand and that works fairly well, and is adapted to the way values are stored in human brains. If we try to have humans give feedback in ways that are not adapted to that, I expect information to be lost. The fact that it “feels natural” is a proxy for “the method of feedback to machines is adapted to the way humans normally give feedback to other humans” without which I am at least concerned about information loss (not claiming it’s inevitable). I don’t inherently care about the “feeling” of naturalness.
Regarding no Safe Natural Intelligence: I agree that there is no such thing, but this is not really a strong argument against? This doesn’t make me somehow suddenly feel comfortable about “unnatural” (I need a better term) methods for humans to provide feedback to AI agents. The fact that there are bad people doesn’t negate the idea that the only source of information about what is good seems to be stored in brains and that we need to extract this information in a way that is adapted to how those brains normally express that information.
Maybe I should have called it “human-adapted methods of human feedback” or something.

Chris van Merwijk Oct 24, 2021, 10:15 AM
18 points
0
on: Countable Factored Spaces
I suggest renaming this to “countably factored spaces”. Countably being a property of the factorization rather than the space.

Also I suggest adding an actual self-contained definition of countable factored space to make it more readable.

Chris van Merwijk Sep 28, 2021, 10:04 AM
2 points
0
in reply to: VojtaKovarik’s comment on: Moloch games
Maybe. I actually don’t think the term “Moloch” is very important. What I think is important is getting a good conceptual understanding of the behavioural notion of “what society wants”, behavioural in the sense that it is independent of idealized notions of what would be good or what individuals imagine society wants but depends on how the collection of agents behaves/is incentivized to behave. I view the fact that this ends up deviating from what would be good for the sum of utilities, as essentially the motivation for this topic, but not the core conceptual problem. So I’d want to nudge people who want to clarify “Molochs” to focus mostly on conceptually clarifying (1) and only secondarily on clarifying (2).

Secondarily, just to push back against your point that “Moloch” is historically more connotated with (2). This is sort of true, but on the other hand, what does the concept of “Moloch” add to our conceptual toolbox, above and beyond the bag of more standard concepts like “collective action problem” and “externalities” and so forth? I’d say that it is already well-understood that collections of individuals can end up interacting in ways that is globally pareto-suboptimal. I think the additions to this analysis made in SSC are something like: conceptualizing various processes as optimizing in a certain direction/looking at the system-level for optimization processes. The core point to get clarity on here is I think (1), and then (2) should fall out of that.

Chris van Merwijk Sep 28, 2021, 9:23 AM
2 points
0
in reply to: VojtaKovarik’s comment on: Moloch games
Yeah I suppose that you’re taking an essential property of a Moloch to be that it wants something other than the sum of utilities. That’s a reasonable terminological condition I suppose, but I’m addressing the question of “what does it even mean for ‘society’ to want anything at all?” Then whatever that is, it might be that (e.g. by some coincidence, or by good coordination mechanisms, or because everyone wants the same thing) what society wants is the same as what would be good for the sum of individual utilities. It seems to me that the question of “what does society want?” is more fundamental than “how does that which society want deviate from what would be good for its individuals?”

Chris van Merwijk Aug 24, 2021, 8:08 AM
1 point
0
AF
on: Finite Factored Sets: Conditional Orthogonality
I think a subpartition of S can be thought of as a partial function on S, or equivalently, a variable on S that has the possible value “Null”/”undefined”.

Chris van Merwijk Aug 23, 2021, 1:42 PM
4 points
0
on: Finite Factored Sets: Orthogonality and Time
I just want to point out some interesting properties of this definition of time: Let time_C refer to the classical notion of time in a dynamical system, and time_FFS the notion defined in this article.

1. Suppose we have a field on space-time generated by a typical differential dynamical law that satisfies time_C-reversal symmetry, and suppose we factorize its histories according to the states of the system at time_C t=0. Then time_FFS doesn’t make a distinction between the “positive” and “negative” part of the time_C. That is, if x is some position (choose a reference frame), then the position (x,2) in space-time (i.e. the value of the field at position x at time_C 2) is later in time_FFS than (x,1), but (x,-2) is also later in time than (x,-1). In this sense, the time_FFS notion of time seems to naturally capture the time-reversal symmetry in the laws of physics: Intuitively, if we start at the big bang, and go “backward in time” we are just as much going into the future as we are if we would go “forward in time”. Both directions are the future.

2. However, more weirdly, time_FFS also allows a comparison between the negative-time_C and positive-time_C events. Namely, (x,1) happens before_FFS (x,-2) while (x, −1) happens before_FFS (x,2). I am not sure what to make of this, or whether we should make anything of it.

3. Suppose a computer is implemented in the physical world and implements a deterministic function $f : X \to Y$ , AND we restrict to the set of histories in which this computer actually does this computation. Now let x denote the variable that captures what input is given to this computer (meaning, the data stored in the input register at one particular instance of running this algorithm), and y similarly denote the variable that captures what the output is, then y occurs (weakly) earlier_FFS than x, even though the variable x is defined to be earlier than y (more precisely, to directly apply the definitions of x and y to check their value in a particular history h would involve doing a check for x at a time_C that is earlier_C than the check for y). I’m not sure what to make of this, though it kind of seems like a feature not a bug. If we don’t restrict to the set of histories in which the computer does the computation, I’m pretty sure this result disappears, which makes me think this is actually a desirable property of the theory.

Chris van Merwijk Aug 23, 2021, 1:19 PM
1 point
0
on: Finite Factored Sets: Orthogonality and Time
In the proof of proposition 18, “part 3” should be “part 4″.

Chris van Merwijk Aug 23, 2021, 10:28 AM
1 point
0
AF
on: Finite Factored Sets: Orthogonality and Time
Can’t you define $C ⊢^{S} X$ for any set $C$ of partitions of $X$ , rather than $C ⊢^{F} X$ w.r.t. a specific factorization $F$ , simply as $C ⊢^{S} X$ iff $⋁_{S} (C) \geq_{S} X$ ? If so, it would seem to me to be clearer to define $⊢$ that way (i.e. make 7 rather than 2 from proposition 10 the definition), and then basically proposition 10 says “if $C$ is a subset of factors of a partition then here are a set of equivalent definitions in terms of chimera”. Also I would guess that proposition 11 is still true for $⊢^{S}$ rather than just for $⊢^{F}$ , though I haven’t checked that 11.6 would still work, but it seems like it should.