Not sure what your meetup content is, or how you feel the real criteria for someone fitting in are. Are you going to talk about science, or technology, or philosophy, or are you going to to do some kind of exercise or group activity, or are you just going to hang out?
For meetups I’ve run in the past, I think the most important criterion of fit is that someone should enjoy training their cognitive skills (which was usually the meat of the meetups), and enjoyment of LW subculture (“did you see X?” being a good way to have a fun conversation / hang out) was an important secondary quality.
I strongly agree, but I think the format of the thing we get, and how to apply it, are still going to require more thought.
Human values as they exist inside humans are going to exist natively as several different, perhaps conflicting, ways of judging human internal ways of representing the world. So first you have to make a model of a human, and figure out how you’re going to locate intentional-stance elements like “representation of the world.” Then you run into ontological crises from moving the human’s models and judgments into some common, more accurate model (that an AI might use). Get the wrong answer in one of these ontological crises, and the modeled utility function may assign high value to something we would regard as deceptive, or as wireheading the human (such reactions might give some hints towards how we want to resolve such ontological crises).
Once we’re comparing human judgments on a level playing field, we can still run into problems of conflicts, problems of circularity, and other weird meta-level conflicts where we don’t value some values that I’m not sure how to address in a principled way. But suppose we compress these judgments into one utility function within the larger model. Are we then done? I’m not sure.
I’m not sure that the agent that constantly twitches is going to be motivated by coherence theorems anyways. Is the class of agents that care about coherence identical to the class of potentially dangerous goal-directed/explicit-utility-maximizing/insert-euphemism-here agents?
When thinking about agents, the first motivation might not quite work out. Small changes in observation might introduce discontinuous changes in policy—e.g. in the Matching Pennies game. Suppose there are agents (functions) in X that output a fixed P(Heads), no matter their input. If you can continuously vary P(heads) by moving in X, then Matching Pennies play will be discontinuous at P(Heads)=0.5. So right away you’ve committed to some unusual behavior for the agents in X by asking for continuity—they can’t play perfect Matching Pennies at the very least.
Because the noise usually grows as the signal does. Consider Moore’s law for transistors per chip. Back when that number was about 10^4, the standard deviation was also small—say 10^3. Now that density is 10^8, no chips are going to be within a thousand transiators of each other, the standard deviation is much bigger (~10^7).
This means that if you’re trying to fit the curve, being off by 10^5 is a small mistake when preducting current transistor #, but a huge mistake when predicting past transistor #. It’s not rare or implausible now to find a chip with 10^5 more transistors, but back in the ’70s that difference is a huge error, impossible under an accurate model of reality.
A basic fitting function, like least squares, doesn’t take this into account. It will trade off transistors now vs. transistors in the past as if the mistakes were of exactly equal importance. To do better you have to use something like a chi squared method, where you explicitly weight the points differently based on their variance. Or fit on a log scale using the simple method, which effectively assumes that the noise is proportional to the signal.
When trying to fit an exponential curve, don’t weight all the points equally. Or if you’re using excel and just want the easy way, take the log of your values and then fit a straight line to the logs.
Ah, it started so well. And then the numbered list started, and you didn’t use any of the things from before the list at all! You assumed some new things (1, 2 and 3) that contained your entire conclusion.
Let me try to redirect you just a little.
Suppose we flip a coin and hide it under a cup without looking at it. We should bet as if the coin has P(Heads)=0.5, because when we are ignorant we can’t do better than assigning a probability, even though the reality is fixed. In fact, the same argument applies before flipping the coin if we ignore quantum effects—the universe is already arranged such that the coin will land heads or tails, but because we don’t know which, we assign a probability.
Now suppose that you get to look at the coin, while I don’t. Now you should assign P(Heads)=1 if it is heads, and P(Heads)=0 if it is tails, but I should still assign P(Heads)=0.5. Different people can assign different probabilities, and that’s okay.
The Sleeping Beauty problem has two perspectives—Sleeping Beauty’s view, and the experimenter’s view (or god’s view). In these two views, you face different constraints. To Sleeping Beauty, she is special and she knows that certain logical relationships hold between the allowed day and the state of the coin. To the experimenter, the coin and the day are independent variables, and no instance of Sleeping Beauty is special.
(note: if you think the day being Monday is an “invalid” observable, just suppose that there is a calendar outside the room and Sleeping Beauty is predicting what she will see when she checks the calendar, much like how we predicted what we would see when we looked at the flipped coin.)
Everyone thinks that assigning probabilities from the experimenter’s view is easy, but they disagree about Sleeping Beauty’s view.
Here’s a trick that tells you about what betting odds Sleeping Beauty should assign, using only the easy experimenter’s view! Just suppose that the experimenter is betting money against Sleeping Beauty—every time Sleeping Beauty wakes up she makes this bet. Every dollar won by Sleeping Beauty is lost by the experimenter. What is a fair price for Sleeping Beauty to pay, in exchange for the experimenter paying her $1.00 if the day is Monday?
We don’t need to use Sleeping Beauty’s view to answer this question. We just use the fact that the experimenter’s view is easy, and the bet is fair if the experimenter doesn’t gain or lose any money on average, from the experimenter’s view. With probability 0.5 (for the experimenter) Sleeping Beauty only wakes up on Monday, and with probability 0.5 (for the experimenter) she wakes up on both Monday and Tuesday and makes the bet both times. So with probability 0.5 the experimenter pays a dollar and gets the fair price, and with probability 0.5 the experimenter pays a dollar and gets twice the fair price.
In other words, 3 times the fair price = 2 dollars. The fair price for a bet that pays Sleeping Beauty on Monday is $2/3.
Looks pretty interesting. I’m not super sold on this being a “nice” business model, since playing constructed at a competitive level still seems like a multi-hundred-dollar buy-in that’s only going to increase with further expansions. But I like drafting anyhow, so sure.
I’m also a little concerned about some of the big power differences in heroes, and certain instances of early-game RNG—a little is necessary but I think things can get unfun when there’s a high enough variance and clear enough options that you can tell that you’ve probably already won or lost, but have to play it out anyhow.
Still, I’ll probably get it—I’m more or less done with Slay the Spire (if you like card-based combat, puzzly roguelikes, good balance, and high difficulty, I definitely recommend that game, but at this point I’ve beaten A20 with all the dudes, and don’t feel like going for high winrate), and the gameplay videos seem interesting.
Anyone can PM me if they want to talk Artifact, I guess?
Have you read the Blue-Minimizing Robot? Early Homo sapiens was in the simple environment where it seemed like they were “minimizing blue,” i.e. maximizing genetic fitness. Now, you might say, it seems like our behavior indicates preferences for happiness, meaning, validation, etc, but really that’s just an epiphenomenon no more meaningful than our previous apparent preference for genetic fitness.
However, there is an important difference between us and the blue-minimizing robot, which is that we have a much better model of the world, and within that model of the world we do a much better job than the robot at making plans. What kind of plans? The thing that motivates our plans is, from a purely functional perspective, our preferences. And this thing isn’t all that different in modern humans versus hunter-gatherers. We know, we’ve talked to them. There have been some alterations due to biology and culture, but not as much as there could have been. Hunter-gatherers still like happiness, meaning, validation, etc.
What seems to have happened is that evolution stumbled upon a set of instincts that produced human planning, and that in the ancestral environment this correlated well with genetic fitness, but in the modern environment this diverges even though the planning process itself hasn’t changed all that much. There are certain futuristic scenarios that could seriously disrupt the picture of human values I’ve given, but I don’t think it’s the default, particularly if there aren’t any optimization processes much stronger than humans running around.
Hm. I wonder what an “alternative” to neural nets and gradient descent would look like. Neural nets are really just there as a highly expressive model class that gradient descent works on.
One big difficulty is that if your model is going to classify pictures of cats (or go boards, etc.), it’s going to be pretty darn complicated, and I’m sceptical that any choice of model class is going to prevent that. But maybe one could try to “hide” this complexity in a recursive structure. Neural nets already do this, but convnets especially mix up spatial hierarchy with logical hierarchy, and nns in general aren’t as nicely packaged into human-thought-sized pieces as maybe they could be—consider resnets, which work well precisely because they abandon the pretense of each neuron being some specific human-scale logical unit.
So maybe you could go the opposite direction and make that pretense a reality with some kind of model class that tries to enforce “human-thought-sized” reused units with relatively sparse inter-unit connections? Could still train with SGD, or treat hypotheses as decision trees and take advantage of that literature.
But suppose we got such a model class working, and trained it to recognize cats. Would it actually be human-comprehensible? Probably not! I guess I’m just not really clear on what “designed for transparency and alignability” is supposed to cash out to at this stage of the game.
I think Sean Carroll does a pretty good job, e.g. in Free Will Is As Real As Baseball.
Interesting! I’m still concerned that, since you need to aggregate these things in the end anyhow (because everything is commensurable in the metric of affecting decisions), the aggregation function is going to be allowed to be very complicated and dependent on factors that don’t respect the separation of this trichotomy.
But it does make me consider how one might try to import this into value learning. I don’t think it would work to take these categories as given and then try to learn meta-preferences to sew them together, but most (particularly more direct) value learning schemes have to start with some “seed” of examples. If we draw that seed only from “approving,” does that mean that the trained AI isn’t going to value wanting or liking enough? Or would everything probably be fine, because we wouldn’t approve of bad stuff?
#8 actually comes up in physics:
in the field of nonlinear dynamics (pretty picture, actual wikipedia). The fact that continuous changes in functions can lead to surprising changes in fixed points (specifically stable attractors) is pretty darn important to understanding e.g. phase transitions!
Does this work for #7? (and question) (Spoilers for #6):
I did #6 using 2D Sperner’s lemma and closedeness. Imagine the the destination points are colored [as in #5, which was a nice hint] by where they are relative to their source points—split the possible difference vectors into a colored circle as in #5 [pick the center to be a fourth color so you can notice if you ever sample a fixed point directly, but if fixed points are rare this shouldn’t matter], and take samples to make it look like 2d Sperner’s lemma, in which there must be at least one interior tri-colored patch. Define a limit of zooming in that moves you towards the tri-colored patch, apply closedness to say the center (fixed) point is included, much like how we were encouraged to do #2 with 1D Sperner’s lemma.
To do #7, it seems like you just need to show that there’s a continuous bijection that preserves whether a point is interior or on the edge, from any convex compact subset of R^2 to any other. And there is indeed a recipe to do this—it’s like you imagine sweeping a line across the two shapes, at rates such that they finish in equal time. Apply a 1D transformation (affine will do) at each point in time to make the two cross sections match up and there you are. This uses the property of convexity, even though it seems like you should be able to strengthen this theorem to work for simply connected compact subsets (if not—why not?).
EDIT: (It turns out that I think you can construct pathological shapes with uncountable numbers of edges for which a simple linear sweep fails no matter the angle, because you’re not allowed to sweep over an edge of one shape while sweeping over a vertex of the other. But if we allow the angle to vary slightly with parametric ‘time’, I don’t think there’s any possible counterexample, because you can always find a way to start and end at a vertex.)
Then once you’ve mapped your subset to a triangle, you use #6. But.
This doesn’t use the hint! And the hints have been so good and educational everywhere I’ve used them. So what am I missing about the hint?
As a physicist, this is my favorite one for obvious reasons :)
Yeah, I did the same thing :)
Putting it right after #2 was highly suggestive—I wonder if this means there’s some very different route I would have thought of instead, absent the framing.
Shrug I dunno man, that seems hard :) I just tend to evaluate community norms by how well they’ve worked elsewhere, and gut feeling. But neither of these is any sort of diamond-hard proof.
Your question at the end is pretty general, and I would say that most chakra-theorists would not want to join this community, so in a sense we’re already mostly avoiding chakra-theorists—and there are other groups who are completely unrepresented. But I think the mechanism is relatively indirect, and that’s good.
Consider something like protecting the free speech of people you strongly disagree with. It can be an empirical fact (according to one’s model of reality) that if just those people were censored, the discussion would in fact improve. But such pointlike censorship is usually not an option that you actually have available to you—you are going to have unavoidable impacts on community norms and other peoples’ behavior. And so most people around here protect something like a principle of freedom of speech.
If costs are unavoidable, then, isn’t that just the normal state of things? You’re thinking of “harm” as relative to some counterfactual state of non-harm—but there are many counterfactual states an online discussion group could be in that would be very good, and I don’t worry too much about how we’re being “harmed” by not being in those states, except when I think I see a way to get there from here.
In short, I don’t think I associate the same kind of negative emotion with these kinds of tradeoffs that you do. They’re just a fairly ordinary part of following a strategy that gets good results.
I like to make the distinction between thinking the chakra-theorists are valuable members of the community, and thinking that it’s important to have community norms that include the chakra-theorists.
It’s a lot like the distinction between morality and law. The chakra theorists are probably wrong and in fact it probably harms the community that they’re here. But it’s not a good way to run a community to kick them out, so we shouldn’t, and in fact we should be as welcoming to them as we think we should be to similar groups that might have similar prima facie silliness.
So, to sum up (?):
We want the AI to take the “right” action. In the IRL framework, we think of getting there by a series of ~4 steps - (observations of human behavior) → (inferred human decision in model) → (inferred human values) → (right action).
Going from step 1 to 2 is hard, and ditto with 2 to 3, and we’ll probably learn new reasons why 3 to 4 is hard when try to do it more realistically. You mostly use model mis-specification to illustrate this—because very different models of step 2 can predict similar step 1, the inference is hard in a certain way. Because very different models of step 3 can predict similar step 2, that inference is also hard.