If you want to chat, message me!
LW1.0 username Manfred. PhD in condensed matter physics. I am independently thinking and writing about value learning.
If you want to chat, message me!
LW1.0 username Manfred. PhD in condensed matter physics. I am independently thinking and writing about value learning.
Nah, it’s about formalizing “you can just think about neurons, you don’t have to simulate individual atoms.” Which raises the question “don’t have to for what purpose?”, and causal closure answers “for literally perfect simulation.”
Causal closure is impossible for essentially every interesting system, including classical computers (my laptop currently has a wiring problem that definitely affects its behavior despite not being the sort of thing anyone would include in an abstract model).
Are there any measures of approximate simulation that you think are useful here? Computer science and nonlinear dynamics probably have some.
I think it’s possible to be better than humans currently are at minecraft, I can say more if this sounds wrong
Yeah, that’s true. The obvious way is you could have optimized micro, but that’s kinda boring. More like what I mean might be generalization to new activities for humans to do in minecraft that humans would find fun, which would be a different kind of ‘better at minecraft.’
[what do you mean by preference conflict?]
I mean it in a way where the preferences are modeled a little better than just “the literal interpretation of this one sentence conflicts with the literal interpretation of this other sentence.” Sometimes humans appear to act according to fairly straightforward models of goal-directed action. However, the precise model, and the precise goals, may be different at different times (or with different modeling hyperparameters, and of course across different people) - and if you tried to model the human well at all the different times, you’d get a model that looked like physiology and lost the straightforward talk of goals/preferences
Resolving preference conflicts is the process of stitching together larger preferences out of smaller preferences, without changing type signature. The reason literally-interpreted-sentences doesn’t really count is because interpreting them literally is using a smaller model than necessary—you can find a broader explanation for the human’s behavior in context that still comfortably talks about goals/preferences.
Fair enough.
Yes, it seems totally reasonable for bounded reasoners to consider hypotheses (where a hypothesis like ‘the universe is as it would be from the perspective of prisoner #3’ functions like treating prisoner #3 as ‘an instance of me’) that would be counterfactual or even counterlogical for more idealized reasoners.
Typical bounded reasoning weirdness is stuff like seeming to take some counterlogicals (e.g. different hypotheses about the trillionth digit of pi) seriously despite denying 1+1=3, even though there’s a chain of logic connecting one to the other. Projecting this into anthropics, you might have a certain systematic bias about which hypotheses you can consider, and yet deny that that systematic bias is valid when presented with it abstractly.
This seems like it makes drawing general lessons about what counts as ‘an instance of me’ from the fact that I’m a bounded reasoner pretty fraught.
I think it doesn’t actually work for the repugnant conclusion—the buttons are supposed to just purely be to the good, and not have to deal with tradeoffs.
Once you start having to deal with tradeoffs, then you get into the aesthetics of population ethics—maybe you want each planet in the galaxy to have a vibrant civilization of happy humans, but past that more happy humans just seems a bit gauche—i.e. there is some value past which the raw marginal utility of cramming more humans into the universe is negative. Any a button promising an existing human extra life might be offered, but these humans are all immortal if they want to be anyhow, and their lives are so good it’s hard to identify one-size-fits-all benefits one could even in theory supply via button, without violating any conservation laws.
All of this is a totally reasonable way to want the future of the universe to be arranged, incompatible with the repugnant conclusion. And still compatible with rejecting the person-affecting view, and pressing the offered buttons in our current circumstances.
Suppose there are a hundred copies of you, in different cells. At random, one will be selected—that one is going to be shot tomorrow. A guard notifies that one that they’re going to be shot.
There is a mercy offered, though—there’s a memory-eraser-ray handy. The one who knows they’te going to be shot is given the option to erase their memory of the warning and everything that followed, putting them in the same information state, more or less, as any of the other copies.
“Of course!” They cry. “Erase my memory, and I could be any of them—why, when you shoot someone tomorrow, there’s a 99% chance it won’t even be me!”
Then the next day comes, and they get shot.
I do like the idea of having “model organisms of alignment” (notably different than model organisms of misalignment)
Minecraft is a great starting point, but it would also be nice to try to capture two things: wide generalization, and inter-preference conflict resolution. Generalization because we expect future AI to be able to take actions and reach outcomes that humans can’t, and preference conflict resolution because I want to see an AI that uses human feedback on how best to do it (rather than just a fixed regularization algorithm).
I did not believe this until I tried it myself, so yes, very paradox, much confusing.
Seemed well-targeted to me. I do feel like it could have been condensed a bit or had a simpler storyline or something, but it was still a good post.
I think 3 is close to an analogy where the LLM is using ‘sacrifice’ in a different way than we’d endorse on reflection, but why should it waste time rationalizing to itself in the CoT? All LLMs involved should just be fine-tuned to not worry about it—they’re collaborators not adversaries—and so collapse to the continuum between cases 1 and 2, where noncentral sorts of sacrifices get ignored first, gradually until at the limit of RL all sacrifices are ignored).
Another thing to think about analogizing is how an AI doing difficult things in the real world is going to need to operate domains that are automatically noncentral to human concepts. It’s like if what we really cared about was whether the AI would sacrifice in 20-move-long sequences that we can’t agree on a criterion to evaluate. Could try a sandwiching experiment where you sandwich both the pretraining corpus and the RL environment.
Perhaps I should have said that it’s silly to ask whether “being like A” or “being like B” is the goal of the game.
I have a different concern than most people.
An AI that follows the english text rather than the reward signal on the chess example cannot on that basis be trusted to do good things when given english text plus human approval reward in the real world. This is because following the text in simple well-defined environments is solving a different problem than following values-laden text in the real world.
The problem that you need to solve for alignment in the real world is how to interpret the world in terms of preferences when even we humans disagree and are internally inconsistent, and how to generalize those preferences to new situations in a way that does a good job satisfying those same human preferences (which include metapreferences). Absent a case for why this is happening, a proposed AI design doesn’t strike me as dealing with alignment.
When interpreting human values goes wrong, the AI’s internal monologue does not have to sound malevolent or deceptive. It’s not thinking “How can I deceive the humans into giving me more power so I can make more paperclips?” It might be thinking “How can I explain to the humans that my proposal is what will help them the most?” Perhaps, if you can find its internal monologue describing its opinion of human preferences in detail, they might sound wrong to you (Wait a minute, I don’t want to be a wirehead god sitting on a lotus throne!), or maybe it’s doing this generalization implicitly, and just understands the output of the paraphraser slightly differently than you would.
Yeah, this makes sense.
You could also imagine more toy-model games with mixed ecological equilibria.
E.g. suppose there’s some game where you can reproduce by getting resources, and you get resources by playing certain strategies, and it turns out there’s an equilibrium where there’s 90% strategy A in the ecosystem (by some arbitrary accounting) and 10% strategy B. It’s kind of silly to ask whether it’s A or B that’s winning based on this.
Although now that I’ve put things like that, it does seem fair to say that A is ‘winning’ if we’re not at equilibrium, and A’s total resources (by some accounting...) is increasing over time.
Now to complicate things again, what if A is increasing in resource usage but simultaneously mutating to be played by fewer actual individuals (the trees versus pelagibacter, perhaps)? Well, in the toy model setting it’s pretty tempting to say the question is wrong, because if the strategy is changing it’s not A anymore at all, and A has been totally wiped out by the new strategy A’.
Actually I guess I endorse this response in the real world too, where if a species is materially changing to exploit a new niche, it seems wrong to say “oh, that old species that’s totally dead now sure were winners.” If the old species had particular genes with a satisfying story for making it more adaptable than its competitors, perhaps better to take a gene’s-eye view and say those genes won. If not, just call it all a wash.
Anyhow, on humans: I think we’re ‘winners’ just in the sense that the human strategy seems better than our population 200ky ago would have reflected, leading to a population and resource use boom. As you say, we don’t need to be comparing ourselves to phytoplankton, the game is nonzero-sum.
The bet is indeed on. See you back here in 2029 :)
Sadly for your friend, the hottest objects in the known universe are still astronomical rather than manmade. The LHC runs on the scale of 10 TeV (10^13 eV). The Auger observatory studies particles that start at 10^18 eV and go up from there.
Ok, I’m agreeing in principle to make the same bet as with RatsWrongAboutUAP.
(“I commit to paying up if I agree there’s a >0.4 probability something non-mundane happened in a UFO/UAP case, or if there’s overwhelming consensus to that effect and my probability is >0.1.”)
I think you can do some steelmanning of the anti-flippers with something like Lara Buchak’s arguments on risk and rationality. Then you’d be replacing the vague “the utility maximizing policy seems bad” argument with a more concrete “I want to do population ethics over the multiverse” argument.
I did a podcast discussion with Undark a month or two ago, a discussion with Arvind Narayanan from AI Snake Oil. https://undark.org/2024/11/11/podcast-will-artificial-intelligence-kill-us-all/
Well, that went quite well. Um, I think two main differences I’d like to see are, first, a shift in attention from ‘AGI when’ to more specific benchmarks/capabilities. Like, ability to replace 90% of the work of an AI researcher (can you say SWE-bench saturated? Maybe in conversation with Arvind only) when?
And then the second is to try to explicitly connect those benchmarks/capabilities directly to danger—like, make the ol’ King Midas analogy maybe? Or maybe just that high capabilities → instability and risk inherently?
I agree with many of these criticisms about hype, but I think this rhetorical question should be non-rhetorically answered.
No, that’s not how RL works. RL—in settings like REINFORCE for simplicity—provides a per-datapoint learning rate modifier. How does a per-datapoint learning rate multiplier inherently “incentivize” the trained artifact to try to maximize the per-datapoint learning rate multiplier? By rephrasing the question, we arrive at different conclusions, indicating that leading terminology like “reward” and “incentivized” led us astray.
How does a per-datapoint learning rate modifier inherently incentivize the trained artifact to try to maximize the per-datapoint learning rate multiplier?
For readers familiar with markov chain monte carlo, you can probably fill in the blanks now that I’ve primed you.
For those who want to read on: if you have an energy landscape and you want to find a global minimum, a great way to do it is to start at some initial guess and then wander around, going uphill sometimes and downhill sometimes, but with some kind of bias towards going downhill. See the AlphaPhoenix video for a nice example. This works even better than going straight downhill because you don’t want to get stuck in local minima.
The typical algorithm for this is you sample a step and then always take it if it’s going downhill, but only take it with some probability if it leads uphill (with smaller probability the more uphill it is). But another algorithm that’s very similar is to just take smaller steps when going uphill than when going downhill.
If you were never told about the energy landscape, but you are told about a pattern of larger and smaller steps you’re supposed to take based on stochastically sampled directions, than an interesting question is: when can you infer an energy function that’s implicitly getting optimized for?
Obviously, if the sampling is uniform and the step size when going uphill looks like it could be generated by taking the reciprocal of the derivative of an energy function, you should start getting suspicious. But what if the sampling is nonuniform? What if there’s no cap on step size? What if the step size rule has cycles or other bad behavior? Can you still model what’s going on as a markov chain monte carlo procedure plus some extra stuff?
I don’t know, these seem like interesting questions in learning theory. If you search for questions like “under what conditions does the REINFORCE algorithm find a global optimum,” you find papers like this one that don’t talk about MCMC, so maybe I’ve lost the plot.
But anyhow, this seems like the shape of the answer. If you pick random steps to take but take bigger steps according to some rule, then that rule might be telling you about an underlying energy landscape you’re doing a markov chain monte carlo walk around.
Euan seems to be using the phrase to mean (something like) causal closure (as the phrase would normally be used e.g. in talking about physicalism) of the upper level of description—basically saying every thing that actually happens makes sense in terms of the emergent theory, it doesn’t need to have interventions from outside or below.