I’m a PhD student at the University of Amsterdam. I have research experience in multivariate information theory and equivariant deep learning and recently got very interested into AI alignment. https://langleon.github.io/
Leon Lang
That might be worth mentioning, as I wondered about the same. (I didn’t realize until now that all the slope curves start at the same point on the left hand side of the figure)
In older texts on AI alignment, there seems quite some discussion on how to learn human values, like here:
https://ai-alignment.com/the-easy-goal-inference-problem-is-still-hard-fad030e0a876
My impression is that nowadays, the alignment problem seems more focused on something which I would describe as “teach the AI to follow any goal at all”, as if the goal with which we should align the AI with doesn’t matter as much from a research perspective.
Could someone provide some insights into the reasons for this? Or are my impressions wrong and I hallucinated the shift?
I upvoted since I think discussing what should or should not be discussed is important, but I tentatively disagree:
It seems unlikely that comments on lesswrong speed up capabilities research since the thoughts are probably just a subset of what the scaling teams know, and lesswrong is likely not their highest signal information source anyway.
Even from a safety perspective, it seems important to know which problems in capabilities research can be alleviated, since this will give a clearer picture of timelines.
I think we should have strong reasons before discouraging topics of discussion since lesswrong is not only a place for instrumental rationality but also epistemic rationality—maybe even more so.
That said, lesswrong is de facto one of the best places to discuss AI safety since the alignment forum is invite-only. thus, it seems that there should be some discussion around which tradeoffs to make on LW between “figuring out what’s true” and “not spreading info hazards”.
Thanks for your answer!
In the worlds where there’s not much future risk of a LWer someday posting a dangerous capabilities insight, there’s also less future benefit to LW posts, since we’re probably not generating many useful ideas in general (especially about AGI and AGI alignment).
This seems correct, though it’s still valuable to flesh out that it seems possible to have LW posts that are helpful for alignment but not for capabilities: namely, such posts that summarize insights from capabilities research that are known to ~all capabilities people while known to few alignment people.
The main reason I shifted more to your viewpoint now is that capabilities insights might influence people who do not yet know a lot about capabilities to work on that in the future, instead of working on alignment. Therefore, I’m also not sure if Marius’ heuristic “Has company-X-who-cares-mostly-about-capabilities likely thought about this already?” for deciding whether something is infohazardy is safe.‘We should require a high bar before we’re willing to not-post potentially-world-destroying information to LW, because LW has a strong commitment to epistemic rationality’ seems like an obviously terrible argument to me. People should not post stuff to the public Internet that destroys the world just because the place they’re posting is a website that cares about Bayesianism and belief accuracy.
Yes, that seems correct (though I’m a bit unhappy about you bluntly straw-manning my position). I think after reflection I would phrase my point as follows:
”There is a conflict between Lesswrongs commitment to epistemic rationality on the one hand, and the commitment to restrict info hazards on the other hand. Lesswrong’s commitment to epistemic rationality exists for good reasons, and should not be given up lightly. Therefore, whenever we restrict discussion and information about certain topics, we should have thought about this with great care.”
I don’t yet have a fleshed-out view on this, but I did move a bit in Tom’s direction.
So far, the summaries are only “tested” by people who have worked through the whole curriculum themselves. They used the summaries to check their understanding of the articles and contrast their view with mine.
So I’m not yet confident that someone could just read my summaries without at the same time going through the full articles, but it seems worth a try.
Thank you for writing this!
Can I write a “proof” on why we shouldn’t rely on human feedback?
Possibly such a proof exists. With more assumptions, you can get better information on human values, see here.
This obviously doesn’t solve all concerns.Iterated Distillation-Amplification seems pointless though cause the humans need to scale with the AGI
Can you elaborate on that point?
Should I write a list of bad assumptions people keep making in alignment work? [...] that suffering is a relevant risk from AGI (suffering is inefficient, it’s an anti-convergent goal)
Only a few people think about this a lot—I currently can only think of the Center on Long-Term Risk on the intersection of suffering focus and AI Safety. Given how bad suffering is, I’m glad that there are people thinking about it, and do not think that a simple inefficiency argument is enough.
Ethical constraints of fiddling with brains [...] We could solve this is we could fully simulate the human brain...
I hope I don’t misrepresent you by putting these two quotes together. Is your position that the ethical dilemmas of “fiddling with human brains” would be solved by, instead, just fiddling with simulated brains? If so, then I disagree: I think simulated brains are also moral patients, to the same degree that physical brains are. I like this fiction a lot.
But essentially, I suspect generating suffering as a subgoal for an AGI is something like an anti-convergent goal: It makes almost all other goals harder to achieve.
I think I basically agree (though maybe not with as much high confidence as you), but I think that doesn’t mean that huge amounts of suffering will not dominate the future. For example, if there will be not one but many superintelligent AI systems determining the future, this might create suffering due to cooperation failures.
This is my first short form. It doesn’t have any content, I just want to test the functionality.
- 2 Oct 2022 11:00 UTC; 2 points) 's comment on Leon Lang’s Shortform by (
This is my first comment on my own, i.e., Leon Lang’s, shortform. It doesn’t have any content, I just want to test the functionality.
Yes, it seems like both creating a “New Shortform” when hovering over my user name and commenting on “Leon Lang’s Shortform” will do the exact same thing. But I can also reply to the comments.
Yes, after reflection I think this is correct. I think I had in mind a situation where with deployment, the training of the AI system simply stops, but of course, this need not be the case. So if training continues, then one either needs to argue stronger reasons why the distribution shift leads to a catastrophe (e.g., along the lines you argue) or make the case that the training signal couldn’t keep up with the fast pace of the development. The latter would be an outer alignment failure, which I tried to avoid talking about in the text.
To classify as specification gaming, there needs to be bad feedback provided on the actual training data. There are many ways to operationalize good/bad feedback. The choice we make here is that the training data feedback is good if it rewards exactly those outputs that would be chosen by a competent, well-motivated AI.
I assume you would agree with the following rephrasing of your last sentence:
The training data feedback is good if it rewards outputs if and only if they might be chosen by a competent, well-motivated AI.
If so, I would appreciate it if you could clarify why achieving good training data feedback is even possible: the system that gives feedback necessarily looks at the world through observations that conceal large parts of the state of the universe. For every observation that is consistent with the actions of a competent, well-motivated AI, the underlying state of the world might actually be catastrophic from the point of view of our “intentions”. E.g., observations can be faked, or the universe can be arbitrarily altered outside of the range of view of the feedback system.
If you agree with this, then you probably assume that there are some limits to the physical capabilities of the AI, such that it is possible to have a feedback mechanism that cannot be effectively gamed. Possibly when the AI becomes more powerful, the feedback mechanism would in turn need to become more powerful to ensure that its observations “track reality” in the relevant way.
Does there exist a write-up of the meaning of specification gaming and/or outer alignment that takes into account that this notion is always “relative” to the AI’s physical capabilities?
Specification gaming: Even on the training distribution, the AI is taking object-level actions that humans think are bad. This can include bad stuff that they don’t notice at the time, but that seems obvious to us humans abstractly reasoning about the hypothetical.
Do you mean “the AI is taking object-level actions that humans think are bad while achieving high reward”?
If so, I don’t see how this solves the problem. I still claim that every reward function can be gamed in principle, absent assumptions about the AI in question.
Neither of your interpretations is what I was trying to say. It seems like I expressed myself not well enough.
What I was trying to say is that I think outer alignment itself, as defined by you (and maybe also everyone else), is a priori impossible since no physically realizable reward function that is defined solely based on observations rewards only actions that would be chosen by a competent, well-motivated AI. It always also rewards actions that lead to corrupted observations that are consistent with the actions of a benevolent AI. These rewarded actions may come from a misaligned AI.
However, I notice people seem to use the terms of outer and inner alignment a lot, and quite some people seem to try to solve alignment by solving outer and inner alignment separately. Then I was wondering if they use a more refined notion of what outer alignment means, possibly by taking into account the physical capabilities of the agent, and I was trying to ask if something like that has already been written down anywhere.
Thanks a lot for these pointers!
Does this contest still run, given that the FTX Future Fund doesn’t exist anymore?
How, if at all, does your alignment approach deal with deceptive alignment?
Did you accidentally forget to add this post to your research journal sequence?
Here my quick reactions on many of the points in the post:
optimization algorithms (finitely terminating)
iterative methods (convergent)
That sounds as if as if they are always finitely terminating or convergent, which they’re not. (I don’t think you wanted to say they are)
Computational optimization can learn but cannot be divided. It can compute all computable functions (Turing machine or a human with pen/paper). However, if you break up the cognitive processing parts, no computation will take place.
I don’t quite understand this. What does the sentence “computational optimization can compute all computable functions” mean? Additionally, in my conception of “computational optimization” (which is admittedly rather vague), learning need not take place.
The structure of deep learning mimics the structure of intelligence as path finding through world states
I find these analogies and your explanations a bit vague. What makes it hard for me to judge what’s behind these analogies:
You write “Intelligence = Mapping current world state to target world state (or target direction)”:
these two options are conceptually quite different and might influence the meaning of the analogy. If intelligence computes only a “target direction”, then this corresponds to a heuristic approach in which locally, the correct direction in action space is chosen. However, if you view intelligence as an actual optimization algorithm, then what’s chosen is not only a direction but a whole path.
Further nitpick: I wouldn’t use the verb “to map” here. I think you mean more something like “to transform”, especially if you mean the optimization viewpoint.
You write “Learning consists of setting the right weights between all the neurons in all the layers. This is analogous to my understanding of human intelligence as path-finding through reality”
Learning is a thing you do once, and then you use the resulting neural network repeatedly. In contrast, if you search for a path, you usually use that path only once.
The output of a neural network can be a found path itself. That makes the analogy even more difficult to me.
Is human imagination and “thinking through different ways past events might have gone” a form of data augmentation? We perturb a memory and then project out how we would have felt and what we would have wanted to do. This seems quite similar to using simulation to generate and improve predictions.
Off-policy reinforcement learning is built on this idea. One famous example is DQN, which uses experience replay. The paper is still worth reading today; some consider it the start of deep RL.
Our utility function is encoded in our neural activity.
I think the terms “our utility function” and “encoded” are not well-defined enough to be able to outright say whether this is true or not, but under a reasonable interpretation of the terms, it seems correct to me.
An aligned AGI is one that has learned the function that maps our neurally encoded utility function to observable world states.
I do not know what you mean by “mapping a utility function to world states”. Is the following a correct paraphrasing of what you mean?
“An aligned AGI is one that tries to steer toward world states such that the neurally encoded utility function, if queried, would say ‘these states are rather optimal’ ”
Thus the most reliable signal of the human utility function is either:
Aggregation over a large enough sample that all the noise is cancelled out
Direct biological measures of our utility function
However, who says there are no systematic biases and errors in our behavior that do not cancel out over large samples?
There are indeed biases in our decision-making that mean that the utility function cannot be inferred from our behavior alone, as shown in humans can be assigned any values whatsoever.
I also don’t think it’s feasible to directly measure our utility function. In my own view, our utility function isn’t an observable thing. There might be a utility function that gets revealed by running history far into the future and observing on what humans converge on, but I don’t think the end result of what we value can be directly measured in our brains.
Specifically, humans gain utility directly from various stimuli and observations like eating sweet food or looking at puppies.
We cannot gain “utility”. We can only gain “reward”. Utility is a measure of world states, whereas reward is a thing happening in our brains.
Instead, we have many (scarcely known) hyperparameters where the utility we get from our observations comes from the transformation and evaluation of one or many sets of observations. For instance, the satisfaction of a job well-done relies on observing the entire process and then evaluating the end result as good. Similarly, many observations that consist of directly negative stimuli (parameters) are evaluated as positive by some hyperparameter such as the meaningfulness of childbirth or the beautiful release of a funeral.
I don’t quite understand the analogy to hyperparameters here. To me, it seems like childbirth’s meaning is in itself a reward that, by credit assignment, leads to a positive evaluation of the actions that led to it, even though in the experience the reward was mostly negative. It is indeed interesting figuring out what exactly is going on here (and the shard theory of human values might be an interesting frame for that, see also this interesting post looking at how the same external events can trigger different value updates), but I don’t yet see how it connects to hyperparameters.
but it’s much less clear how we’d map the ever-fickle hyperparameters of our utility function that entirely hinge on our evaluations and transformations we ourselves apply to our experiences … it’s a value we compute internally that would require the AGI to simulate us as full-bodied beings to get the exact same result.
What if instead of trying to build an AI that tries to decode our brain’s utility function, we build the process that created our values in the first place and expose the AI to this process?
The counterargument would be that language models lack grounding in reality.
The distinction between vision and language models breaks down with things like vision transformers. But in general, the lack of grounding of pure language models seems a problem to me for reaching AGI with it. But I think a language model that interacts with the world through, e.g., an internet connection, might already get rid of this grounding problem.
Self-supervised learning is the default form of learning for individual agents embedded in reality.
That seems pretty plausible to me for achieving AGI, but many RL agents do not have an explicit self-supervised component.
Feature engineering seems like a form of pre-processing, and thus not a relevant concept for AGI? We’d expect AGI to learn it’s own features. Which is what kernels in convolutional neural networks do, for instance.
I mostly agree. But note that feature engineering is just a form of an inductive prior, and it’s not possible to get rid of those and let them be “learned”—there is no free lunch.
Overfitting in a neural network is basically memorizing the data set.
Many models that do not overfit also memorize much of the data set.
I might have overloaded the phrase “computational” here. My intention was to point out what can be encoded by such a system. Maybe “coding” is a better word? E.g., neural coding. These systems can implement Turing machines so can potentially have the same properties of turing machines.
I see. I think I was confused since, in my mind, there are many Turing machines that simply do not “optimize” anything. They just compute a function.
I’m wondering if our disagreement is conceptual or semantic. Optimizing a direction instead of an entire path is just a difference in time horizon in my model. But maybe this is a different use of the word “optimize”?
I think I wanted to point to a difference in the computational approach of different algorithms that find a path through the universe. If you chain together many locally found heuristics, then you carve out a path through reality over time that may lead to some “desirable outcome”. But the computation would be vastly different from another algorithm that thinks about the end result and then makes a whole plan of how to reach this. It’s basically the difference between deontology and consequentialism. This post is on similar themes.
I’m not at all sure if we disagree about anything here, though.
If I learn the optimal path to work, then I can use that multiple times. I’m not sure I agree with the distinction you are drawing here … Some problems in life only need to be solved exactly once, but that’s the same as any thing you learn only being applicable once.
I would say that if you remember the plan and retrieve it later for repeated use, then you do this by learning and the resulting computation is not planning anymore. Planning is always the thing you do at the moment to find good results now, and learning is the thing you do to be able to use a solution repeatedly.
Part of my opinion also comes from the intuition that planning is the thing that derives its use from the fact that it is applied in complex environments in which learning by heart is often useless. The very reason why planning is useful for intelligent agents is that they cannot simply learn heuristics to navigate the world.
To be fair, it might be that I don’t have the same intuitive connection between planning and learning in my head that you do, so if my comments are beside the point, then feel free to ignore :)
A hyperparameter is a parameter across parameters. So say with childbirth, you have a parameter pain on physical pain which is a direct physical signal, and you have a hyperparameter ‘Satisfaction from hard work’ that takes ‘pain’ as input as well as some evaluative cognitive process and outputs reward accordingly. Does that make sense?
Conceptually it does, thank you! I wouldn’t call these parameters and hyperparameters, though. Low-level and high-level features might be better terms.
Again, I think the shard theory of human values might be an inspiration for these thoughts, as well as this post on AGI motivation which talks about how valence gets “painted” on thoughts in the world model of a brain-like AGI.
Is this on the sweet spot just before overfitting or should I be thinking of something else?
I personally don’t have good models for this. Ilya Sutskever mentioned in a podcast that under some models of bayesian updating, learning by heart is optimal and a component of perfect generalization. Also from personal experience, I think that people who generalize very well also often have lots of knowledge, though this may be confounded by other effects.
Is there anything precise known about the distribution over the severity of symptoms I should expect as a 20-30 year old? I’m in that age-group, that’s why I’m interested.
So, what I’d like to know specifically, conditional on being infected, is:
How likely would I be asymptomatic?
How likely would I have symptoms not more severe than the common cold?
How likely would symptoms comparable in severity to the flu (being mostly in the bed for maybe 2 weeks but nothing more)?
How likely would I have mild to moderate pneumonia with which I could still stay at home?
How likely would I need to go into the hospital and receive oxygen, but no mechanical ventilation?
How likely would I need mechanical ventilation?
How likely is it I might die even if I receive mechanical ventilation?