Wei Dai(Wei Dai)
Did SBF or Mao Zedong not have a pointer to the right values, or had a right pointer but made mistakes due to computational issues (i.e., would have avoided causing the disasters that they did if they were smarter and/or had more time to think)? Both seem possible to me, so I’d like to understand how the QACI approach would solve (or rule out) both of these potential problems:
If many humans don’t have pointers to right values, how to make sure QACI gets a pointer from humans who have a pointer to the right values?
How to make sure that AI will not make some catastrophic mistake while it’s not smart enough to fully understand the values we give it, while still being confident enough in its guesses of what to do in the short term to do useful things?
Moral uncertainty is an area in philosophy with ongoing research, and assuming that AI will handle it correctly by default seems unsafe, similar to assuming that AI will have the right decision theory by default.
I see that Tasmin Leake also pointed out 2 above as a potential problem, but I don’t see anything that looks like a potential solution at QACI table of contents.
Katja Grace notes that image synthesis methods have no trouble generating photorealistic human faces.
They’re terrible at hands though (which has ruined many otherwise good images for me). That post used Stable Diffusion 1.5, but even the latest SD 3.0 (with versions 2.0, 2.1, XL, Stable Cascade in between) is still terrible at it.
Don’t really know how relevant this is to your point/question about fragility of human values, but thought I’d mention it since it seems plausibly as relevant as AIs being able to generate photorealistic human faces.
Adversarial examples suggest to me that by default ML systems don’t necessarily learn what we want them to learn:
They put too much emphasis on high frequency features, suggesting a different inductive bias from humans.
They don’t handle contradictory evidence in a reasonable way, i.e., giving a confident answer when high frequency features (pixel-level details) and low frequency features (overall shape) point to different answers.
The evidence from adversarial training suggests to me that AT is merely patching symptoms (e.g., making the ML system de-emphasize certain specific features) and not fixing the underlying problem. At least this is my impression from watching this video on Adversarial Robustness, specifically the chapters on Adversarial Arms Race and Unforeseen Adversaries.
Aside from this, it’s also unclear how to apply AT to your original motivation:
A function that tells your AI system whether an action looks good and is right virtually all of the time on natural inputs isn’t safe if you use it to drive an enormous search for unnatural (highly optimized) inputs on which it might behave very differently.
because in order to apply AT we need a model of what “attacks” the adversary is allowed to do (in this case the “attacker” is a superintelligence trying to optimize the universe, so we have to model it as being allowed to do anything?) and also ground-truth training labels.
For this purpose, I don’t think we can use the standard AT practice of assuming that any data point within a certain distance of a human-labeled instance, according to some metric, has the same label as that instance. Suppose we instead let the training process query humans directly for training labels (i.e., how good some situation is) on arbitrary data points, well that’s slow/costly if the process isn’t very sample efficient (which modern ML isn’t), and also scary if human implementations of human values may already have adversarial examples. (The “perceptual wormholes” work and other evidence suggest that humans also aren’t 100% adversarially robust.)
My own thinking is that we probably need to go beyond adversarial training for this, along the lines of solving metaphilosophy and then using that solution to find/fix existing adversarial examples and correctly generalize human values out of distribution.
I’m confused about how heterogeneity in data quality interacts with scaling. Surely training a LM on scientific papers would give different results from training it on web spam, but data quality is not an input to the scaling law… This makes me wonder whether your proposed forecasting method might have some kind of blind spot in this regard, for example failing to take into account that AI labs have probably already fed all the scientific papers they can into their training processes. If future LMs train on additional data that have little to do with science, could that keep reducing overall cross-entropy loss (as scientific papers become a smaller fraction of the overall corpus) but fail to increase scientific ability?
Thank you for detailing your thoughts. Some differences for me:
I’m also worried about unaligned AIs as a competitor to aligned AIs/civilizations in the acausal economy/society. For example, suppose there are vulnerable AIs “out there” that can be manipulated/taken over via acausal means, unaligned AI could compete with us (and with others with better values from our perspective) in the race to manipulate them.
I’m perhaps less optimistic than you about commitment races.
I have some credence on max good and max bad being not close to balanced, that additionally pushes me towards the “unaligned AI is bad” direction.
ETA: Here’s a more detailed argument for 1, that I don’t think I’ve written down before. Our universe is small enough that it seems plausible (maybe even likely) that most of the value or disvalue created by a human-descended civilization comes from its acausal influence on the rest of the multiverse. An aligned AI/civilization would likely influence the rest of the multiverse in a positive direction, whereas an unaligned AI/civilization would probably influence the rest of the multiverse in a negative direction. This effect may outweigh what happens in our own universe/lightcone so much that the positive value from unaligned AI doing valuable things in our universe as a result of acausal trade is totally swamped by the disvalue created by its negative acausal influence.
Perhaps half of the value of misaligned AI control is from acausal trade and half from the AI itself being valuable.
Why do you think these values are positive? I’ve been pointing out, and I see that Daniel Kokotajlo also pointed out in 2018 that these values could well be negative. I’m very uncertain but my own best guess is that the expected value of misaligned AI controlling the universe is negative, in part because I put some weight on suffering-focused ethics.
If something is both a vanguard and limited, then it seemingly can’t stay a vanguard for long. I see a few different scenarios going forward:
We pause AI development while LLMs are still the vanguard.
The data limitation is overcome with something like IDA or Debate.
LLMs are overtaken by another AI technology, perhaps based on RL.
In terms of relative safety, it’s probably 1 > 2 > 3. Given that 2 might not happen in time, might not be safe if it does, or might still be ultimately outcompeted by something else like RL, I’m not getting very optimistic about AI safety just yet.
The argument is that with 1970′s tech the soviet union collapsed, however with 2020 computer tech (not needing GenAI) it would not.
I note that China is still doing market economics, and nobody is trying (or even advocating, AFAIK) some very ambitious centrally planned economy using modern computers, so this seems like pure speculation? Has someone actually made a detailed argument about this, or at least has the agreement of some people with reasonable economics intuitions?
I’ve arguably lived under totalitarianism (depending on how you define it), and my parents definitely have and told me many stories about it. I think AGI increases risk of totalitarianism, and support a pause in part to have more time to figure out how to make the AI transition go well in that regard.
Even if someone made a discovery decades earlier than it otherwise would have been, the long term consequences of that may be small or unpredictable. If your goal is to “achieve high counterfactual impact in your own research” (presumably predictably positive ones) you could potentially do that in certain fields (e.g., AI safety) even if you only counterfactually advance the science by a few months or years. I’m a bit confused why you’re asking people to think in the direction outlined in the OP.
Some of my considerations for college choice for my kid, that I suspect others may also want to think more about or discuss:
status/signaling benefits for the parents (This is probably a major consideration for many parents to push their kids into elite schools. How much do you endorse it?)
sex ratio at the school and its effect on the local “dating culture”
political/ideological indoctrination by professors/peers
workload (having more/less time/energy to pursue one’s own interests)
I added this to my comment just before I saw your reply: Maybe it changes moment by moment as we consider different decisions, or something like that? But what about when we’re just contemplating a philosophical problem and not trying to make any specific decisions?
I mostly offer this in the spirit of “here’s the only way I can see to reconcile subjective anticipation with UDT at all”, not “here’s something which makes any sense mechanistically or which I can justify on intuitive grounds”.
Ah I see. I think this is incomplete even for that purpose, because “subjective anticipation” to me also includes “I currently see X, what should I expect to see in the future?” and not just “What should I expect to see, unconditionally?” (See the link earlier about UDASSA not dealing with subjective anticipation.)
ETA: Currently I’m basically thinking: use UDT for making decisions, use UDASSA for unconditional subjective anticipation, am confused about conditional subjective anticipation as well as how UDT and UDASSA are disconnected from each other (i.e., the subjective anticipation from UDASSA not feeding into decision making). Would love to improve upon this, but your idea currently feels worse than this...
As you would expect, I strongly favor (1) over (2) over (3), with (3) being far, far worse for ‘eating your whole childhood’ reasons.
Is this actually true? China has (1) (affirmative action via “Express and objective (i.e., points and quotas)”) for its minorities and different regions and FWICT the college admissions “eating your whole childhood” problem over there is way worse. Of course that could be despite (1) not because of it, but does make me question whether (3) (“Implied and subjective (‘we look at the whole person’).”) is actually far worse than (1) for this.
Intuitively this feels super weird and unjustified, but it does make the “prediction” that we’d find ourselves in a place with high marginal utility of money, as we currently do.
This is particularly weird because your indexical probability then depends on what kind of bet you’re offered. In other words, our marginal utility of money differs from our marginal utility of other things, and which one do you use to set your indexical probability? So this seems like a non-starter to me… (ETA: Maybe it changes moment by moment as we consider different decisions, or something like that? But what about when we’re just contemplating a philosophical problem and not trying to make any specific decisions?)
By “acausal games” do you mean a generalization of acausal trade?
Yes, didn’t want to just say “acausal trade” in case threats/war is also a big thing.
This was all kinda rambly but I think I can summarize it as “Isn’t it weird that ADT tells us that we should act as if we’ll end up in unusually important places, and also we do seem to be in an incredibly unusually important place in the universe? I don’t have a story for why these things are related but it does seem like a suspicious coincidence.”
I’m not sure this is a valid interpretation of ADT. Can you say more about why you interpret ADT this way, maybe with an example? My own interpretation of how UDT deals with anthropics (and I’m assuming ADT is similar) is “Don’t think about indexical probabilities or subjective anticipation. Just think about measures of things you (considered as an algorithm with certain inputs) have influence over.”
This seems to “work” but anthropics still feels mysterious, i.e., we want an explanation of “why are we who we are / where we’re at” and it’s unsatisfying to “just don’t think about it”. UDASSA does give an explanation of that (but is also unsatisfying because it doesn’t deal with anticipations, and also is disconnected from decision theory).
I would say that under UDASSA, it’s perhaps not super surprising to be when/where we are, because this seems likely to be a highly simulated time/scenario for a number of reasons (curiosity about ancestors, acausal games, getting philosophical ideas from other civilizations).
It occurs to me that many alternatives you mention are also superstimuli:
Reading a book
Pretty unlikely or rare to encounter stories or ideas with this much information content or entertainment value in the ancestral environment.
Some people do get addicted to books, e.g., romance novels.
Extroversion / talking to attractive people
We have access to more people, including more attractive people, but talking to anyone is less likely to lead to anything consequential because of birth control and because they also have way more choices.
Sex addiction. People who party all the time.
Creativity
We have the time and opportunity to do a lot more things that feel “creative” or “meaningful” to us, but these activities have less real-world significance than such feelings might suggest because other people have way more creative products/personalities to choose from.
Struggling artists/entertainers who refuse to give up their passions. Obscure hobbies.
Not sure if there are exceptions or not, but it seems like everything we could do for fun these days is some kind of supernormal stimulus, or the “fun” isn’t much related to the original evolutionary purpose anymore. This includes e.g. forum participation. So far I haven’t tried to make great efforts to quit anything, and instead have just eventually gotten bored of certain things I used to be “addicted” to (e.g., CRPGs, micro-optimizing crypto code). (This is not meant to be advice for other people. Also the overall issue of superstimuli/addiction is perhaps more worrying to me than this comment might suggest.)
Does anyone know why security amplification and meta-execution are rarely talked about these days? I did a search on LW and found just 1 passing reference to either phrase in the last 3 years. Is the problem not considered an important problem anymore? The problem is too hard and no one has new ideas? There are too many alignment problems/approaches to work on and not enough researchers?
If you think there’s something mysterious or unknown about what happens when you make two copies of yourself
Eliezer talked about some puzzles related to copying and anticipation in The Anthropic Trilemma that still seem quite mysterious to me. See also my comment on that post.
I think the way morality seems to work in humans is that we have a set of potential moral values, determined by our genes, that culture can then emphasize or de-emphasize. Altruism seems to be one of these potential values, that perhaps got more emphasized in recent times, in certain cultures. I think altruism isn’t directly evolutionarily connected to power, and it’s more like “act morally (according to local culture) while that’s helpful for gaining power” which translates to “act altruistically while that’s helpful for gaining power” in cultures that emphasize altruism. Does this make more sense?
It’s confusing to me as well, perhaps because different people (or even the same person at different times) emphasize different things within the same approach, but here’s one post where someone said, “It is important that the overseer both knows which action the distilled AI wants to take as well as why it takes that action.”