Mathematician turned alignment researcher. Probably happy to chat about math, current ML, or long-term AI thoughts.
The basics—Nathaniel Monson (nmonson1.github.io)
Nathaniel Monson
“You cannot argue with a group. You cannot convince a group of things or change a group’s mind.”
Forgive me if this comes across as trollish, but whose mind are you trying to change with this essay?
To me it seems like your point is either self-refuting (in form, if not meaning) or, at best, incomplete.
I’m looking forward to both this series, and the workshop!
I think I (and probably many other people) would find it helpful if there was an entry in this sequence which was purely the classical story told in a way/with language which makes its deficiencies clear and the contrasts with the Watanabe version very easy to point out. (Maybe a −1 entry, since 0 is already used?)
As someone who successfully first-tried the ball into the cup without any video analysis, my algorithm was:
1) ask to see the ball roll down the ramp but be stopped at the end2) notice the ramp moving with significant flex
3) do the standard calculations for ball assuming all potential is converted to kinetic+rolling, and calculate cup-lip placement accordingly
4) decide that “about 10-15% loss” sounded both right to compensate for the flex and looked good to my physics instincts, and so move the cup closer accordingly.
It was a fun exercise! thanks, John :)
This is a really cool piece of work!
Succinctly, if someone runs into an omega which says “I will give you $1,000,000 if you are someone who would have two-boxed in Newcomb. If you would have one-boxed, I will kill your family”, then the two-boxers have much better outcomes than the one-boxers. You may object that this seems silly and artificial. I think it is no more so than the original problem.
And yes—I think EY is very wrong in the post you link to, and this is a response to the consensus LW view that one-boxing is correct.
The point of UDT as I understand it is that you should be the sort of person who predictably one-boxes in NP. This seems incorrect to me. I think if you are the sort of person who one-boxes in a surprise NP, you will have worse outcomes in general, and that if you have a surprise NP, you should two-box. If you know you will be confronted with NP tomorrow, then sure, you should decide to one-box ahead of time. But I think deciding now to “be the sort of person who would one-box in NP,” (or equivalently, deciding now to commit to a decision theory which will result in that) is a mistake.
Eliezer Yudkowsky and the whole UDT crowd seem to think that you should commit to a decision theory which seems like a bad one to me, on the basis that it would be rational to have precommitted if you end up in this situation. They seem to have convinced most LW people of this. I think they are wrong. I think CDT is a better decision theory which is more intuitive. I agree CDT gives a suboptimal outcome in surprise-NP, but I think any decision theory can give a good or bad outcome in corner-cases, along the lines of “You meet a superintelligent agent which will punish people who use (good decision theory) and reward those who use (bad decision theory).” Thus, NP shouldn’t count as a strike against CDT.
The intention was to portray the transparent box as having lots of money—call it $1,000,000.
Well, if you were confronted with Newcomb’s problem, would you one-box or two box? How fully do you endorse your answer as being “correct” or maximally rational, or anything along those lines?
I’m not trying to argue against anyone who says they aren’t sure, but they think they would one-box or two-box in some hypothetical, or anyone who has thought carefully about the possible existence of unknown unknowns and come down on the “I have no idea what’s optimal, but I’ve predetermined to do X for the sake of predictability” side for either X.
I am arguing against people who think that Newcomb’s problem means causal decision theory is wrong, and that they have a better alternative. I think Newcomb’s provides no (interesting, nontrivial) evidence against CDT.
I think in-story you believes that you will be killed if you make an inconsistent choice (or at least thinks there is a high enough chance that they do choose consistently).
The point of the post isn’t so much the specific set up, as it is an attempt to argue that Newcomb’s problem doesn’t provide any reason to be against causal decision theory.
Upon reflection, it was probably a mistake for me to write this phrased as a story/problem/thought experiment. I should probably have just written a shorter post titled something like “Newcomb’s problem provides no (interesting, non-trivial) evidence against using causal decision theory.” I had some fun writing this, though, and (mistakenly?) hoped that people would have fun reading it.
I think I disagree somewhat that “PNP references the strategy for NP”. I think many (most?) LW people have decided they are “the type of person who one-boxes in NP”, and believe that says something positive about them in their actual life. This post is an attempt to push back on that.
It seems from your comment that you think of “What I, Vladimir Nesov, would do in a thought experiment” as different from what you would actually do in real life. (eg, when you say “the problem statement is very confusing on this point.”). I think of both as being much more closely tied.
Possibly the confusion comes from the difference between what you-VN-would-actually-do and what you think is correct/optimal/rational behavior? Like, in a thought experiment, you don’t actually try to imagine or predict what real-you would do, you just wonder what optimal behavior/strategy is? In that case, I agree that this is a confusing problem statement.
I think you probably could do that, but you’d be restricting yourself to something that might work marginally worse than whatever would otherwise be found by gradient descent. Also, the more important part of the 768 dimensional vector which actually gets processed is the token embeddings.
If you believe that neural nets store things as directions, one way to think of this is as the neural net reserving 3 dimensions for positional information, and 765 for the semantic content of the tokens. If the actual meaning of the words you read is roughly 250 times as important to your interpretation of a sentence as where they come in a sentence, then this should make sense?
This is kinda a silly way of looking at it—we don’t have any reason (that I’m aware of) to think of these as separable, the interactions probably matter a lot—but might be not-totally-worthless as intuition.
@AdamYedidia This is super cool stuff! Is the magnitude of the token embeddings at all concentrated in or out of the 3 PCA dimensions for the positional embeddings? If its concentrated away from that, we are practically using the addition as a direct sum, which is nifty.
(I imagine you know this, but for the sake of future readers) I think the word Torsor is relevant here :) https://math.ucr.edu/home/baez/torsors.html is a nice informal introduction.
Very cool idea!
It looked like several of the text samples were from erotica or something, which...seems like something I don’t want to see without actively opting in—is there an easy way for you to filter those out?
I have actually tried this, not in tug-of-war, but with moving a stuck car (one end affixed to car, one end to a tree or lamppost or something). In that situation, where the objects aren’t actively adjusting to thwart you, it works quite well!
Yep! It might be easier to visualize with a train on tracks—the rope needs to be parallel to the intended direction of movement. Suppose the rope is nearly perfectly taut and tied to something directly in front of the train. Pulling the rope sideways w 100 newtons requires the perp component of force to be 100, definitionally. But the rope can only exert force along itself, so if it missed being taut by radians, it’ll be exerting enough force that . But if the rope is very close to perfectly taut, then , so (in the limit), you’re exerting infinite force.
This fades pretty quickly as the rope gets away from the 0 angle, so you then need to secure the car so it won’t move back (rocks under tires or something), and re-tighten the rope, and iterate.
To a decision-theoretic agent, the value of information is always nonnegative
This seems false. If I selectively give you information in an adversarial manner, and you don’t know that I’m picking the information to harm you, I think it’s very clear that the value of the information you gain can be strongly negative.
I agree we are an existence proof for general intelligence. For alignment, what is the less intelligent thing whose goals humanity has remained robustly aligned to?
Relatedly, I’d really like to be able to attach private notes to author’s names. There are pairs of people on LW with names I find it easy to mistake, and being able to look at the author of a post or comment and see a self-note “This is the user who is really insightful about X” or “Don’t start arguing with this person, it takes forever and goes nowhere” etc would be very helpful.
One issue that I think OpenAI didn’t convince me they had dealt with is that saying “neuron activations are well correlated with x” is different from being able to say what specifically a neuron does mechanistically. I think of this similarly to how I think of the limitations of picking max activating examples from a dataset or doing gradient methods to find high activations: finding the argmax of a function doesn’t necessarily tell you much about the functions...well, functionality.
This seems like it might have a related obstacle. While this method could eg make it easier to find a focus for mechanistic interpretability, I think the bulk of the hard work would still be ahead.
Another option: my father reports he usually memorizes phone numbers based on the geometric pattern they make on a typical keypad.