AIS student, self-proclaimed aspiring rationalist, very fond of game theory.
”The only good description is a self-referential description, just like this one.”
momom2
The AI safety leaders currently see slow takeoff as humans gaining capabilities, and this is true; and also already happening, depending on your definition. But they are missing the mathematically provable fact that information processing capabilities of AI are heavily stacked towards a novel paradigm of powerful psychology research, which by default is dramatically widening the attack surface of the human mind.
I assume you do not have a mathematical proof of that, or you’d have mentioned it. What makes you think it is mathematically provable?
I would be very interested in reading more about the avenues of research dedicated to showing how AI can be used for psychological attacks from the perspective of AIS (I’d expect such research to be private by default due to infohazards).
A new paper, built upon the compendium of problems with RLHF, tries to make an exhaustive list of all the issues identified so far: Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback
This reminds me a lot of existentialcomics.
Epistemic status: Had a couple conversations on AI Plans with the founder, participated in the previous critique-a-thon. I’ve helped AI Plans a bit before, so I’m probably biased towards optimism.
Neglectedness: Very neglected. AI Plans wants to become a database of alignment plans which would allow quick evaluation of whether an approach is worth spending effort on, at least as a quick sanity check for outsiders. I can’t believe it didn’t exist before! Still very rough and unuseable for that purpose for now, but that’s what the critique-a-thon is for: hopefully, as critiques accumulate and more votes are fed into the system, it will become more useful.
Tractability: High. It may be hard to make winning critiques, but considering the current state of AI Plans, it’s very easy to make an improvement. If anything, you can filter out the obvious failures.
Impact: I’m not as confident here. If AI Plans works as intended, it could be very valuable to allocate funds more efficiently and save time by figuring out which approaches should be discarded. However, it’s possible that it will just fail to gain steam and become a stillborn project. I’ve followed it for a couple months, and I’ve been positively surprised several times, so I’m pretty optimistic.
The bar to entry is pretty low; if you’ve been following AIS blogs or forums for several months, you probably have something to contribute. It’s very unlikely you’ll have a negative impact.
It may also be an opportunity for you to discuss with AIS-minded people and check your opinions on a practical problem; if you feel like an armchair safetyist and tired to be one, this is the occasion to level up.
Another way to think about it is that the engagement was very low in previous critique-a-thon so if you have a few hours to spare, you can make some easy money and fuzzies even if you’re not sure about the value in utilons.
You’re making many unwarranted assumptions about an AI’s specific mind, along with a lot of confusion about semantics which seems to indicate you should just read the Sequences. It’ll be very hard to point out where you are going wrong because there’s just too much confusion.
As example, here’s a detailed analysis of the first few paragraphs:Intelligence will always seek more data in order to better model the future and make better decisions.
Unclear if you mean intelligence in general, and if so, what you mean by the word. Since the post is about AI, let’s talk about that. AI does not necessarily seek more data. Typically, most modern AIs are trained on a training dataset provided by developers, and do not actively seek more data.
There is also not necessarily an “in order to”. Not all AIs are agentic.
Not all AIs model the future at all. Very few agentic AIS have as a terminal goal to make better decisions—though it is expected that advanced AI by default will do that as an instrumental behavior, and possibly as instrumental or terminal goal because of the convergent instrumental goals thesis.Conscious intelligence needs an identity to interact with other identities, identity needs ego to know who and what it is. Ego would often rather be wrong than admit to being wrong.
You use connotated, ill-defined words to go from consciousness to identity to ego to refusing to admit to being wrong. Definitions have no causal impact on the world (in first order considerations, a discussion of self-fulfilling terminology is beyond this comment). That’s not to say you have to use well-defined words, but you should be able to taboo your words properly before you use technical words with controversial/exotic-but-specifically-defined-in-this-community meaning. And really, I would recommend you just read more on the subject of consciousness; theory of mind is a keyword that will get you far on LW.
Non conscious intelligence can build a model of consciousness from all the data it has been trained on because it all originated from conscious humans. AI could model a billion consciousness’s a million years into the future, it will know more about it than we ever will. But AI will not chose to become conscious.
Non-sequitur, wrong reasons to have approximately correct beliefs… Just, please read more about AI before having an opinion.
Later, you show examples of false dichotomy, privileging the hypothesis, reference class error… it’s not better quality than the paragraphs I commented in detail.
So in conclusion, where are you going wrong? Pretty much everywhere. I don’t think your comment is salvageable, I’d recommend just discarding that train of thought altogether and keeping your mind open while you digest more literature.
Disclaimer: This comment was written as part of my application process to become an intern supervised by the author of this post.
Potential uses of the post:
This post is an excellent summary, and I think it has great potential for several purposes, in particular being used as part of a sequence on RLHF. It is a good introduction for many reasons:
It’s very useful to have lists like those, easily accessible to serve as reminders or pointers when you discuss with other people.
For aspiring RLHF understanders, it can provide minimum information to quickly prioritize what to learn about.
It can be used to generate ideas of research (“which of these problems could I solve?”) or superficially check that an idea is not promising (“it looks fancy, but actually it does not help against this problem”).
It can be used as a gateway to more in-depth articles. To that end, I would really appreciate it if you put links for each point, or mention that you are not aware of any specific article on the subject.
Meta level critics:
If it is taken as an introduction to RLHF risks, you should make clear where this list is exhaustive (to the best of your knowledge). This will allow readers who are aware it isn’t to easily propose additions.
To facilitate its improvement, you should make explicit calls to the reader to point out where you suspect the post might fail; in particular, there could be a class of readers who are experts in a specific problem with RLHF not listed here, who come only to get a glimpse of related failure modes. They should be encouraged to participate.As Daniel Kokotajlo and trevor have pointed out, the main value of this post is to provide an easy way to learn more about the problems with RLHF (as opposed to e.g. LOL which tries to be an insightful, comprehensive compilation on its own), thanks to the format and the organization.
The epistemic status of each point is unclear, which I think is a big issue. You give your thoughts after each section, but there is a big lack of systematic evaluation. You should separate for each point:
your opinion,
its severity,
its likelihood,
whether we have empirical, theoretical evidence or abstract reasons it should happen.
This has not been done in a systematic fashion, and it could be organized more clearly.
More specific criticism:
I am unsatisfied with how 7) is described. It is not a problem on the same level as others, more the destruction of a quality that fortunately seems to happen by default on GPTs. It could use a more in-depth explanation, especially since the linked article is mostly speculation.
I also think 11) belongs to this category of ‘not quite a problem’, because it is not obvious that direct human feedback would be better than learning a model of it.
Maybe an easy way to predict humans noticing misalignment is to have a fully general model of what it means to be misaligned? Unlikely, but it deserves a longer discussion.9) is another point that requires a longer discussion. Since it seems to be your own work, maybe you could write an article and link to it?
What are the costs of RLHF (money and manpower) and how do they compare to scaling laws? Maybe it’s an issue… but maybe not. Data is needed here.Talking about the Strawberry Problem is a bit unfair, because RLHF was never meant to solve it, so not only is it not surprising RLHF provides little insight into the Strawberry Problem, I also don’t expect that a solution to the Strawberry Problem would relate at all with RLHF. It seems like a different paradigm altogether.
More generally, RLHF is exactly the kind of methods warned against by a security mindset. It is an ad hoc method that afaik provides no theoretical guarantee of working at all. The issues with superficial alignment and the inability to generalize alignment in case of a distributional shift are related to that.
Why would we have any reason a priori to expect good behavior from RLHF? In the first section, you give empirical reasons to count RLHF as progress but a discussion of the reasons RLHF was even considered in the first place is noticeably lacking.
To be honest, I am very surprised there is no mention of that. Did OpenAI not disclose how they invented RLHF? Did they randomly imagine the process and it happened to work?
In conclusion, I believe that there is a strong need for this kind of post, but that it could be polished more for the potential purposes proposed above.
This post by the same author answers your comment: https://carado.moe/surprise-you-want.html
Freedom is just a heuristic; let’s call the actual thing we want for humans our values (which is what we hope Elua will return in this scenario). By definition, our values are everything we want, including possibly the abolition of anthropocentrism.
What is meant here by freedom and utopia is “the best scenario”. It’s not about what our values are, it’s about a method proposed to reach them.
AlexNet dates back to 2012, I don’t think previous work on AI can be compared to modern statistical AI.
Paul Christiano’s foundational paper on RLHF dates back to 2017.
Arguably, all of agent foundations work turned out to be useless so far, so prosaic alignment work may be what Roko is taking as the beginning of AIS as a field.
That sounds nice but is it true? Like, that’s not an argument, and it’s not obvious! I’m flabbergasted it received so many upvotes.
Can someone please explain?
It’s not obvious at all to me, but it’s certainly a plausible theory worth testing!
To clarify: what I am confused about is the high AF score, which probably means that there is something exciting I’m not getting from this paper.
Or maybe it’s not a missing insight, but I don’t understand why this kind of work is interesting/important?
I felt like I had a pretty good grasp on what was happening, but in the end I’m just as confused as at the beginning… ’^-^
The presentation starts at 3:50.
It’s not so easy, but this is the perspective of Luna. I for once really enjoy how the information (especially dialogiue) is dumbed down to what she perceives.
However they ambushed Harry is not relevant to what she thinks.
I don’t think the two closed answers of “Have you stopped beating your wife ?” have such a well-defined meaning. Since this is natural language, and I understand a no as meaning “I’m still beating her.” and I expect most people to interpret a no the same way as I, then it’s not from obvious why this interpretation is incorrect (if we ignore that the sentence is typically used as an example that has no good answer. Use “Will you stop smoking soon ?” which is less standard for the sake of the argument.)
Thanks for the greeting ! Since theism is by far the most obvious discrepancy between my opinions and the community’s, I figured I should clear that up as soon as possible.
When I don’t think specifically about it, I just don’t have opinions. I usually feel that morality is a thing, but most of the time I don’t think about what morality is.
Likewise, I’m no expert of christian dogma. I weakly feel that I must not take the Genesis literally, and I strongly feel that talking snakes don’t exist. In general, I weakly feel that [whatever the Church says about it] just like if you ask me about AI, I’ll answer [whatever Yudkowsky wrote].All in all, the discussion so far has made pretty clear that I should taboo the word “morality” in my upcoming post...
I’d say morality is something the world has ? In the context where I used it above, that’s what I meant by “a moral world” : morality is taken as a property of the world, that pervades its components be them actions (murdering is bad), objects (murder is bad) and people (murderers are bad). These three sentences make sense to me, but they don’t designate the same kind of bad.
Although I have no right to claim that someone is irredeemably evil (Hitler might have done something right), I could condemn a specific action (Chauvin shouldn’t have killed Floyd).I’m not sure about groups of people. I guess you could judge an ideology and judge the group of people who follow that ideology ? That does not sound very helpful, because it’s a weak judgement of every individual, which begs the question of their individual morality.
Although it could probably be used as a useful judging heuristic (this group of people is good, so its members are likely to be good), I don’t see how to reach this conclusion without evaluating many members.
When I said that about feelings, I meant that it was my everyday tool to distinguish good from bad, just like my everyday tool to evaluate the correctedness of a mathematical demonstration is “is the result coherent and interesting ?”. It is merely an indicator, and not what I would use if presented with a specific, important case.
They’re correlated with morality (which is why I use them) but not perfectly. I also know murder is bad because I was taught so, because many people think murder is bad, etc… All is evidence, strong or weak.
In no way does it tell me why something is moral, although if I try to go up the reasoning chain I might find something interesting.
In this case, I found that my reasoning was stopped at “God said so.” and I was unsatisfied, hence why I sought help.
I’m not sure either, but here’s my current model:
Even though it looks pretty likely that AISC is an improvement on no-AISC, there are very few potential funders:
1) EA-adjacent caritative organizations.
2) People from AIS/rat communities.Now, how to explain their decisions?
For the former, my guess would be a mix of not having heard of/received an application from AISC and preferring to optimize heavily towards top-rated charities. AISC’s work is hard to quantify, as you can tell from the most upvoted comments, and that’s a problem when you’re looking for projects to invest because you need to avoid being criticized for that kind of choice if it turns out AISC is crackpotist/a waste of funds. The Copenhagen interpretation of ethics applies hard there for an opponent with a tooth against the organization.
For the latter, it depends a lot on individual people, but here are the possibilities that come to mind:
- Not wanting donate anything but feeling like having to, which leads to large donations to few projects when you feel like donating enough to break the status quo bias.
- Being especially mindful of one’s finances and donating only to preferred charities, because of a personal attachment (again, not likely to pick AISC a priori) or because they’re provably effective.
To answer 2), you can say why you don’t donate to AISC? Your motivations are probably very similar to other potential donators here.
Follow this link to find it. The translation is made by me, and open to comments. Don’t hesitate to suggest improvements.
To whom it may concern, here’s a translation of “Bold Orion” in French.
A lot of the argumentation in this post is plausible, but also, like, not very compelling?
Mostly the “frictionless” model of sexual/gender norms, and the examples associated: I can see why these situations are plausible (if at least because they’re very present in my local culture) but I wouldn’t be surprised if they are a bunch of social myth either, in which case the whole post is invalidated.
I appreciate the effort though; it’s food for thought even if it doesn’t tell me much about how to update based on the conclusion.