Even if you manage to truly forget about the disease, there must exist a mind “somewhere in the universe” that is exactly the same as yours except without knowledge of the disease. This seems quite unlikely to me, because you having the disease has interacted causally with the rest of your mind a lot by when you decide to erase its memory. What you’d really need to do is to undo all the consequences of these interactions, which seems a lot harder to do. You’d really need to transform your mind into another one that you somehow know is present “somewhere in the multiverse” which seems also really hard to know.
No77e
I deliberately left out a key qualification in that (slightly edited) statement, because I couldn’t explain it until today.
I might be missing something crucial because I don’t understand why this addition is necessary. Why do we have to specify “simple” boundaries on top of saying that we have to draw them around concentrations of unusually high probability density? Like, aren’t probability densities in Thingspace already naturally shaped in such a way that if you draw a boundary around them, it’s automatically simple? I don’t see how you run the risk of drawing weird, noncontiguous boundaries if you just follow the probability densities.
One way in which “spending a whole lot of time working with a system / idea / domain, and getting to know it and understand it and manipulate it better and better over the course of time” could be solved automatically is just by having a truly huge context window. Example of an experiment: teach a particular branch of math to an LLM that has never seen that branch of math.
Maybe humans have just the equivalent of a sort of huge content window spanning selected stuff from their entire lifetimes, and so this kind of learning is possible for them.
You mention eight cities here. Do they count for the bet?
Waluigi effect also seems bad for s-risk. “Optimize for pleasure, …” → “Optimize for suffering, …”.
Iff LLM simulacra resemble humans but are misaligned, that doesn’t bode well for S-risk chances.
An optimistic way to frame inner alignment is that gradient descent already hits a very narrow target in goal-space, and we just need one last push.
A pessimistic way to frame inner misalignment is that gradient descent already hits a very narrow target in goal-space, and therefore S-risk could be large.
We should implement Paul Christiano’s debate game with alignment researchers instead of ML systems
This community has developed a bunch of good tools for helping resolve disagreements, such as double cruxing. It’s a waste that they haven’t been systematically deployed for the MIRI conversations. Those conversations could have ended up being more productive and we could’ve walked away with a succint and precise understanding about where the disagreements are and why.
Another thing one might wonder about is if performing iterated amplification with constant input from an aligned human (as “H” in the original iterated amplification paper) would result in a powerful aligned thing if that thing remains corrigible during the training process.
The comment about tool-AI vs agent-AI is just ignorant (or incredibly dismissive) of mesa-optimizers and the fact that being asked to predict what an agent would do immediately instantiates such an agent inside the tool-AI. It’s obvious that a tool-AI is safer than an explicitely agentic one, but not for arbitrary levels of intelligence.
This seems way too confident to me given the level of generality of your statement. And to be clear, my view is that this could easily happen in LLMs based on transformers, but what other architectures? If you just talk about how a generic “tool-AI” would or would not behave, it seems to me that you are operating on a level of abstraction far too high to be able to make such specific statements with confidence.
If you try to write a reward function, or a loss function, that caputres human values, that seems hopeless.
But if you have some interpretability techniques that let you find human values in some simulacrum of a large language model, maybe that’s less hopeless.
The difference between constructing something and recognizing it, or between proving and checking, or between producing and criticizing, and so on...
Why this shouldn’t work? What’s the epistemic failure mode being pointed at here?
While you can “cry wolf” in maybe useful ways, you can also state your detailed understanding of each specific situation as it arises and how it specifically plays into the broader AI risk context.
As impressive as ChatGPT is on some axes, you shouldn’t rely too hard on it for certain things because it’s bad at what I’m going to call “board vision” (a term I’m borrowing from chess).
How confident are you that you cannot find some agent within ChatGPT with excellent board vision through more clever prompting than what you’ve experimented with?
As a failure mode of specification gaming, agents might modify their own goals.
As a convergent instrumental goal, agents want to prevent their goals to be modified.
I think I know how to resolve this apparent contradiction, but I’d like to see other people’s opinions about it.
I’m going to re-ask all my questions that I don’t think have received a satisfactory answer. Some of them are probably basic, some other maybe less so:
I am trying to figure out what is the relation between “alignment with evolution” and “short-term thinking”. Like, imagine that some people get hit by magical space rays, which make them fully “aligned with evolution”. What exactly would such people do?
I think they would become consequentialists smart enough that they could actually act to maximize inclusive genetic fitness. I think Thou Art Godshatter is convincing.
But what if the art or the philosophy makes it easier to get laid? So maybe in such case they would do the art/philosophy, but they would feel no intrinsic pleasure from doing it, like it would all be purely instrumental, willing to throw it all away if on second thought they find out that this is actually not maximizing reproduction?
Yeah that’s what I would expect.
How would they even figure out what is the reproduction-optimal thing to do? Would they spend some time trying to figure out the world? (The time that could otherwise be spent trying to get laid?) Or perhaps, as a result of sufficiently long evolution, they would already do the optimal thing instinctively? (Because those who had the right instincts and followed them, outcompeted those who spent too much time thinking?)
I doubt that being governed by instincts can outperform a sufficiently smart agent reasoning from scratch, given sufficiently complicated environment. Instincts are just heuristics after all...
But would that mean that the environment is fixed? Especially, if the most important part of the environment is other people? Maybe the humanity would get locked in an equilibrium where the optimal strategy is found, and everyone who tries doing something else is outcompeted; and afterwards those who do the optimal strategy more instinctively outcompete those who need to figure it out. What would such equilibrium look like?
Ohhh interesting, I have no idea… it seems plausible that it could happen though!
No, I mean “humans continue to evolve genetically, and they never start self-modifying in a way that makes evolution impossible (e.g., by becoming emulations).”
From a purely utilitarian standpoint, I’m inclined to think that the cost of delaying is dwarfed by the number of future lives saved by getting a better outcome, assuming that delaying does increase the chance of a better future.
That said, after we know there’s “no chance” of extinction risk, I don’t think delaying would likely yield better future outcomes. On the contrary, I suspect getting the coordination necessary to delay means it’s likely that we’re giving up freedoms in a way that may reduce the value of the median future and increase the chance of stuff like totalitarian lock-in, which decreases the value of the average future overall.
I think you’re correct that there’s also to balance the “other existential risks exist” consideration in the calculation, although I don’t expect it to be clear-cut.