AIS student, self-proclaimed aspiring rationalist, very fond of game theory.
”The only good description is a self-referential description, just like this one.”
momom2
Cheat sheet of AI X-risk
[Question] What criterion would you use to select companies likely to cause AI doom?
The AI safety leaders currently see slow takeoff as humans gaining capabilities, and this is true; and also already happening, depending on your definition. But they are missing the mathematically provable fact that information processing capabilities of AI are heavily stacked towards a novel paradigm of powerful psychology research, which by default is dramatically widening the attack surface of the human mind.
I assume you do not have a mathematical proof of that, or you’d have mentioned it. What makes you think it is mathematically provable?
I would be very interested in reading more about the avenues of research dedicated to showing how AI can be used for psychological attacks from the perspective of AIS (I’d expect such research to be private by default due to infohazards).
A new paper, built upon the compendium of problems with RLHF, tries to make an exhaustive list of all the issues identified so far: Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback
This reminds me a lot of existentialcomics.
[Question] How can there be a godless moral world ?
Epistemic status: Had a couple conversations on AI Plans with the founder, participated in the previous critique-a-thon. I’ve helped AI Plans a bit before, so I’m probably biased towards optimism.
Neglectedness: Very neglected. AI Plans wants to become a database of alignment plans which would allow quick evaluation of whether an approach is worth spending effort on, at least as a quick sanity check for outsiders. I can’t believe it didn’t exist before! Still very rough and unuseable for that purpose for now, but that’s what the critique-a-thon is for: hopefully, as critiques accumulate and more votes are fed into the system, it will become more useful.
Tractability: High. It may be hard to make winning critiques, but considering the current state of AI Plans, it’s very easy to make an improvement. If anything, you can filter out the obvious failures.
Impact: I’m not as confident here. If AI Plans works as intended, it could be very valuable to allocate funds more efficiently and save time by figuring out which approaches should be discarded. However, it’s possible that it will just fail to gain steam and become a stillborn project. I’ve followed it for a couple months, and I’ve been positively surprised several times, so I’m pretty optimistic.
The bar to entry is pretty low; if you’ve been following AIS blogs or forums for several months, you probably have something to contribute. It’s very unlikely you’ll have a negative impact.
It may also be an opportunity for you to discuss with AIS-minded people and check your opinions on a practical problem; if you feel like an armchair safetyist and tired to be one, this is the occasion to level up.
Another way to think about it is that the engagement was very low in previous critique-a-thon so if you have a few hours to spare, you can make some easy money and fuzzies even if you’re not sure about the value in utilons.
You’re making many unwarranted assumptions about an AI’s specific mind, along with a lot of confusion about semantics which seems to indicate you should just read the Sequences. It’ll be very hard to point out where you are going wrong because there’s just too much confusion.
As example, here’s a detailed analysis of the first few paragraphs:Intelligence will always seek more data in order to better model the future and make better decisions.
Unclear if you mean intelligence in general, and if so, what you mean by the word. Since the post is about AI, let’s talk about that. AI does not necessarily seek more data. Typically, most modern AIs are trained on a training dataset provided by developers, and do not actively seek more data.
There is also not necessarily an “in order to”. Not all AIs are agentic.
Not all AIs model the future at all. Very few agentic AIS have as a terminal goal to make better decisions—though it is expected that advanced AI by default will do that as an instrumental behavior, and possibly as instrumental or terminal goal because of the convergent instrumental goals thesis.Conscious intelligence needs an identity to interact with other identities, identity needs ego to know who and what it is. Ego would often rather be wrong than admit to being wrong.
You use connotated, ill-defined words to go from consciousness to identity to ego to refusing to admit to being wrong. Definitions have no causal impact on the world (in first order considerations, a discussion of self-fulfilling terminology is beyond this comment). That’s not to say you have to use well-defined words, but you should be able to taboo your words properly before you use technical words with controversial/exotic-but-specifically-defined-in-this-community meaning. And really, I would recommend you just read more on the subject of consciousness; theory of mind is a keyword that will get you far on LW.
Non conscious intelligence can build a model of consciousness from all the data it has been trained on because it all originated from conscious humans. AI could model a billion consciousness’s a million years into the future, it will know more about it than we ever will. But AI will not chose to become conscious.
Non-sequitur, wrong reasons to have approximately correct beliefs… Just, please read more about AI before having an opinion.
Later, you show examples of false dichotomy, privileging the hypothesis, reference class error… it’s not better quality than the paragraphs I commented in detail.
So in conclusion, where are you going wrong? Pretty much everywhere. I don’t think your comment is salvageable, I’d recommend just discarding that train of thought altogether and keeping your mind open while you digest more literature.
Disclaimer: This comment was written as part of my application process to become an intern supervised by the author of this post.
Potential uses of the post:
This post is an excellent summary, and I think it has great potential for several purposes, in particular being used as part of a sequence on RLHF. It is a good introduction for many reasons:
It’s very useful to have lists like those, easily accessible to serve as reminders or pointers when you discuss with other people.
For aspiring RLHF understanders, it can provide minimum information to quickly prioritize what to learn about.
It can be used to generate ideas of research (“which of these problems could I solve?”) or superficially check that an idea is not promising (“it looks fancy, but actually it does not help against this problem”).
It can be used as a gateway to more in-depth articles. To that end, I would really appreciate it if you put links for each point, or mention that you are not aware of any specific article on the subject.
Meta level critics:
If it is taken as an introduction to RLHF risks, you should make clear where this list is exhaustive (to the best of your knowledge). This will allow readers who are aware it isn’t to easily propose additions.
To facilitate its improvement, you should make explicit calls to the reader to point out where you suspect the post might fail; in particular, there could be a class of readers who are experts in a specific problem with RLHF not listed here, who come only to get a glimpse of related failure modes. They should be encouraged to participate.As Daniel Kokotajlo and trevor have pointed out, the main value of this post is to provide an easy way to learn more about the problems with RLHF (as opposed to e.g. LOL which tries to be an insightful, comprehensive compilation on its own), thanks to the format and the organization.
The epistemic status of each point is unclear, which I think is a big issue. You give your thoughts after each section, but there is a big lack of systematic evaluation. You should separate for each point:
your opinion,
its severity,
its likelihood,
whether we have empirical, theoretical evidence or abstract reasons it should happen.
This has not been done in a systematic fashion, and it could be organized more clearly.
More specific criticism:
I am unsatisfied with how 7) is described. It is not a problem on the same level as others, more the destruction of a quality that fortunately seems to happen by default on GPTs. It could use a more in-depth explanation, especially since the linked article is mostly speculation.
I also think 11) belongs to this category of ‘not quite a problem’, because it is not obvious that direct human feedback would be better than learning a model of it.
Maybe an easy way to predict humans noticing misalignment is to have a fully general model of what it means to be misaligned? Unlikely, but it deserves a longer discussion.9) is another point that requires a longer discussion. Since it seems to be your own work, maybe you could write an article and link to it?
What are the costs of RLHF (money and manpower) and how do they compare to scaling laws? Maybe it’s an issue… but maybe not. Data is needed here.Talking about the Strawberry Problem is a bit unfair, because RLHF was never meant to solve it, so not only is it not surprising RLHF provides little insight into the Strawberry Problem, I also don’t expect that a solution to the Strawberry Problem would relate at all with RLHF. It seems like a different paradigm altogether.
More generally, RLHF is exactly the kind of methods warned against by a security mindset. It is an ad hoc method that afaik provides no theoretical guarantee of working at all. The issues with superficial alignment and the inability to generalize alignment in case of a distributional shift are related to that.
Why would we have any reason a priori to expect good behavior from RLHF? In the first section, you give empirical reasons to count RLHF as progress but a discussion of the reasons RLHF was even considered in the first place is noticeably lacking.
To be honest, I am very surprised there is no mention of that. Did OpenAI not disclose how they invented RLHF? Did they randomly imagine the process and it happened to work?
In conclusion, I believe that there is a strong need for this kind of post, but that it could be polished more for the potential purposes proposed above.
This post by the same author answers your comment: https://carado.moe/surprise-you-want.html
Freedom is just a heuristic; let’s call the actual thing we want for humans our values (which is what we hope Elua will return in this scenario). By definition, our values are everything we want, including possibly the abolition of anthropocentrism.
What is meant here by freedom and utopia is “the best scenario”. It’s not about what our values are, it’s about a method proposed to reach them.
AlexNet dates back to 2012, I don’t think previous work on AI can be compared to modern statistical AI.
Paul Christiano’s foundational paper on RLHF dates back to 2017.
Arguably, all of agent foundations work turned out to be useless so far, so prosaic alignment work may be what Roko is taking as the beginning of AIS as a field.
That sounds nice but is it true? Like, that’s not an argument, and it’s not obvious! I’m flabbergasted it received so many upvotes.
Can someone please explain?
[Question] Do you like excessive sugar?
It’s not obvious at all to me, but it’s certainly a plausible theory worth testing!
To clarify: what I am confused about is the high AF score, which probably means that there is something exciting I’m not getting from this paper.
Or maybe it’s not a missing insight, but I don’t understand why this kind of work is interesting/important?
I felt like I had a pretty good grasp on what was happening, but in the end I’m just as confused as at the beginning… ’^-^
The presentation starts at 3:50.
It’s not so easy, but this is the perspective of Luna. I for once really enjoy how the information (especially dialogiue) is dumbed down to what she perceives.
However they ambushed Harry is not relevant to what she thinks.
I don’t think the two closed answers of “Have you stopped beating your wife ?” have such a well-defined meaning. Since this is natural language, and I understand a no as meaning “I’m still beating her.” and I expect most people to interpret a no the same way as I, then it’s not from obvious why this interpretation is incorrect (if we ignore that the sentence is typically used as an example that has no good answer. Use “Will you stop smoking soon ?” which is less standard for the sake of the argument.)
A lot of the argumentation in this post is plausible, but also, like, not very compelling?
Mostly the “frictionless” model of sexual/gender norms, and the examples associated: I can see why these situations are plausible (if at least because they’re very present in my local culture) but I wouldn’t be surprised if they are a bunch of social myth either, in which case the whole post is invalidated.
I appreciate the effort though; it’s food for thought even if it doesn’t tell me much about how to update based on the conclusion.