momom2

Karma: 122

AIS student, self-proclaimed aspiring rationalist, very fond of game theory.
”The only good description is a self-referential description, just like this one.”

Cheat sheet of AI X-risk

momom229 Jun 2023 4:28 UTC

19 points

1 comment7 min readLW link

momom2 14 Dec 2023 2:03 UTC
17 points
11
on: Is being sexy for your homies?
A lot of the argumentation in this post is plausible, but also, like, not very compelling?
Mostly the “frictionless” model of sexual/gender norms, and the examples associated: I can see why these situations are plausible (if at least because they’re very present in my local culture) but I wouldn’t be surprised if they are a bunch of social myth either, in which case the whole post is invalidated.
I appreciate the effort though; it’s food for thought even if it doesn’t tell me much about how to update based on the conclusion.

[Question] What criterion would you use to select companies likely to cause AI doom?

momom213 Jul 2023 20:31 UTC

8 points

4 comments1 min readLW link

momom2 22 Oct 2023 9:50 UTC
8 points
6
on: AI Safety is Dropping the Ball on Clown Attacks, and Mind Control in General
The AI safety leaders currently see slow takeoff as humans gaining capabilities, and this is true; and also already happening, depending on your definition. But they are missing the mathematically provable fact that information processing capabilities of AI are heavily stacked towards a novel paradigm of powerful psychology research, which by default is dramatically widening the attack surface of the human mind.
I assume you do not have a mathematical proof of that, or you’d have mentioned it. What makes you think it is mathematically provable?
I would be very interested in reading more about the avenues of research dedicated to showing how AI can be used for psychological attacks from the perspective of AIS (I’d expect such research to be private by default due to infohazards).

momom2 4 Aug 2023 15:59 UTC
8 points
2
in reply to: paulfchristiano’s comment on: AGI in sight: our look at the game board
A new paper, built upon the compendium of problems with RLHF, tries to make an exhaustive list of all the issues identified so far: Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback

momom2 6 Aug 2021 19:59 UTC
8 points
on: A Philosopher Walks Into A Coffee Shop
This reminds me a lot of existentialcomics.

[Question] How can there be a godless moral world ?

momom221 Jun 2021 12:34 UTC

7 points

79 comments1 min readLW link

momom2 5 Dec 2023 21:25 UTC
6 points
6
on: Critique-a-Thon of AI Alignment Plans
Epistemic status: Had a couple conversations on AI Plans with the founder, participated in the previous critique-a-thon. I’ve helped AI Plans a bit before, so I’m probably biased towards optimism.
Neglectedness: Very neglected. AI Plans wants to become a database of alignment plans which would allow quick evaluation of whether an approach is worth spending effort on, at least as a quick sanity check for outsiders. I can’t believe it didn’t exist before! Still very rough and unuseable for that purpose for now, but that’s what the critique-a-thon is for: hopefully, as critiques accumulate and more votes are fed into the system, it will become more useful.
Tractability: High. It may be hard to make winning critiques, but considering the current state of AI Plans, it’s very easy to make an improvement. If anything, you can filter out the obvious failures.
Impact: I’m not as confident here. If AI Plans works as intended, it could be very valuable to allocate funds more efficiently and save time by figuring out which approaches should be discarded. However, it’s possible that it will just fail to gain steam and become a stillborn project. I’ve followed it for a couple months, and I’ve been positively surprised several times, so I’m pretty optimistic.
The bar to entry is pretty low; if you’ve been following AIS blogs or forums for several months, you probably have something to contribute. It’s very unlikely you’ll have a negative impact.
It may also be an opportunity for you to discuss with AIS-minded people and check your opinions on a practical problem; if you feel like an armchair safetyist and tired to be one, this is the occasion to level up.
Another way to think about it is that the engagement was very low in previous critique-a-thon so if you have a few hours to spare, you can make some easy money and fuzzies even if you’re not sure about the value in utilons.

momom2 20 Jul 2023 17:55 UTC
6 points
−1
in reply to: archeon’s comment on: Pausing AI Developments Isn’t Enough. We Need to Shut it All Down
You’re making many unwarranted assumptions about an AI’s specific mind, along with a lot of confusion about semantics which seems to indicate you should just read the Sequences. It’ll be very hard to point out where you are going wrong because there’s just too much confusion.

As example, here’s a detailed analysis of the first few paragraphs:
Intelligence will always seek more data in order to better model the future and make better decisions.
Unclear if you mean intelligence in general, and if so, what you mean by the word. Since the post is about AI, let’s talk about that. AI does not necessarily seek more data. Typically, most modern AIs are trained on a training dataset provided by developers, and do not actively seek more data.
There is also not necessarily an “in order to”. Not all AIs are agentic.
Not all AIs model the future at all. Very few agentic AIS have as a terminal goal to make better decisions—though it is expected that advanced AI by default will do that as an instrumental behavior, and possibly as instrumental or terminal goal because of the convergent instrumental goals thesis.
Conscious intelligence needs an identity to interact with other identities, identity needs ego to know who and what it is. Ego would often rather be wrong than admit to being wrong.
You use connotated, ill-defined words to go from consciousness to identity to ego to refusing to admit to being wrong. Definitions have no causal impact on the world (in first order considerations, a discussion of self-fulfilling terminology is beyond this comment). That’s not to say you have to use well-defined words, but you should be able to taboo your words properly before you use technical words with controversial/exotic-but-specifically-defined-in-this-community meaning. And really, I would recommend you just read more on the subject of consciousness; theory of mind is a keyword that will get you far on LW.
Non conscious intelligence can build a model of consciousness from all the data it has been trained on because it all originated from conscious humans. AI could model a billion consciousness’s a million years into the future, it will know more about it than we ever will. But AI will not chose to become conscious.
Non-sequitur, wrong reasons to have approximately correct beliefs… Just, please read more about AI before having an opinion.
Later, you show examples of false dichotomy, privileging the hypothesis, reference class error… it’s not better quality than the paragraphs I commented in detail.
So in conclusion, where are you going wrong? Pretty much everywhere. I don’t think your comment is salvageable, I’d recommend just discarding that train of thought altogether and keeping your mind open while you digest more literature.

momom2 14 Apr 2023 19:21 UTC
6 points
0
on: Compendium of problems with RLHF
Disclaimer: This comment was written as part of my application process to become an intern supervised by the author of this post.
Potential uses of the post:
This post is an excellent summary, and I think it has great potential for several purposes, in particular being used as part of a sequence on RLHF. It is a good introduction for many reasons:
1. It’s very useful to have lists like those, easily accessible to serve as reminders or pointers when you discuss with other people.
2. For aspiring RLHF understanders, it can provide minimum information to quickly prioritize what to learn about.
3. It can be used to generate ideas of research (“which of these problems could I solve?”) or superficially check that an idea is not promising (“it looks fancy, but actually it does not help against this problem”).
4. It can be used as a gateway to more in-depth articles. To that end, I would really appreciate it if you put links for each point, or mention that you are not aware of any specific article on the subject.
Meta level critics:

If it is taken as an introduction to RLHF risks, you should make clear where this list is exhaustive (to the best of your knowledge). This will allow readers who are aware it isn’t to easily propose additions.
To facilitate its improvement, you should make explicit calls to the reader to point out where you suspect the post might fail; in particular, there could be a class of readers who are experts in a specific problem with RLHF not listed here, who come only to get a glimpse of related failure modes. They should be encouraged to participate.
As Daniel Kokotajlo and trevor have pointed out, the main value of this post is to provide an easy way to learn more about the problems with RLHF (as opposed to e.g. LOL which tries to be an insightful, comprehensive compilation on its own), thanks to the format and the organization.
The epistemic status of each point is unclear, which I think is a big issue. You give your thoughts after each section, but there is a big lack of systematic evaluation. You should separate for each point:
- your opinion,
- its severity,
- its likelihood,
- whether we have empirical, theoretical evidence or abstract reasons it should happen.
This has not been done in a systematic fashion, and it could be organized more clearly.

More specific criticism:
1. I am unsatisfied with how 7) is described. It is not a problem on the same level as others, more the destruction of a quality that fortunately seems to happen by default on GPTs. It could use a more in-depth explanation, especially since the linked article is mostly speculation.
2. I also think 11) belongs to this category of ‘not quite a problem’, because it is not obvious that direct human feedback would be better than learning a model of it.
  Maybe an easy way to predict humans noticing misalignment is to have a fully general model of what it means to be misaligned? Unlikely, but it deserves a longer discussion.
3. 9) is another point that requires a longer discussion. Since it seems to be your own work, maybe you could write an article and link to it?
  What are the costs of RLHF (money and manpower) and how do they compare to scaling laws? Maybe it’s an issue… but maybe not. Data is needed here.
4. Talking about the Strawberry Problem is a bit unfair, because RLHF was never meant to solve it, so not only is it not surprising RLHF provides little insight into the Strawberry Problem, I also don’t expect that a solution to the Strawberry Problem would relate at all with RLHF. It seems like a different paradigm altogether.
5. More generally, RLHF is exactly the kind of methods warned against by a security mindset. It is an ad hoc method that afaik provides no theoretical guarantee of working at all. The issues with superficial alignment and the inability to generalize alignment in case of a distributional shift are related to that.
6. Why would we have any reason a priori to expect good behavior from RLHF? In the first section, you give empirical reasons to count RLHF as progress but a discussion of the reasons RLHF was even considered in the first place is noticeably lacking.
  To be honest, I am very surprised there is no mention of that. Did OpenAI not disclose how they invented RLHF? Did they randomly imagine the process and it happened to work?
In conclusion, I believe that there is a strong need for this kind of post, but that it could be polished more for the potential purposes proposed above.

momom2 23 Feb 2023 18:09 UTC
6 points
1
in reply to: MSRayne’s comment on: Hello, Elua.
This post by the same author answers your comment: https://carado.moe/surprise-you-want.html
Freedom is just a heuristic; let’s call the actual thing we want for humans our values (which is what we hope Elua will return in this scenario). By definition, our values are everything we want, including possibly the abolition of anthropocentrism.
What is meant here by freedom and utopia is “the best scenario”. It’s not about what our values are, it’s about a method proposed to reach them.

momom2 26 Oct 2023 7:59 UTC
5 points
−2
in reply to: Vaniver’s comment on: Architects of Our Own Demise: We Should Stop Developing AI
AlexNet dates back to 2012, I don’t think previous work on AI can be compared to modern statistical AI.
Paul Christiano’s foundational paper on RLHF dates back to 2017.
Arguably, all of agent foundations work turned out to be useless so far, so prosaic alignment work may be what Roko is taking as the beginning of AIS as a field.

momom2 25 Jul 2023 13:55 UTC
5 points
0
in reply to: romeostevensit’s comment on: Toni Kurz and the Insanity of Climbing Mountains
That sounds nice but is it true? Like, that’s not an argument, and it’s not obvious! I’m flabbergasted it received so many upvotes.
Can someone please explain?

[Question] Do you like excessive sugar?

momom29 Oct 2021 10:40 UTC

3 points

11 comments1 min readLW link

momom2 23 Jan 2024 22:01 UTC
3 points
2
in reply to: Lao Mein’s comment on: ′ petertodd’’s last stand: The final days of open GPT-3 research
It’s not obvious at all to me, but it’s certainly a plausible theory worth testing!

momom2 22 Jul 2023 21:31 UTC
3 points
3
in reply to: momom2’s comment on: Even Superhuman Go AIs Have Surprising Failure Modes
To clarify: what I am confused about is the high AF score, which probably means that there is something exciting I’m not getting from this paper.
Or maybe it’s not a missing insight, but I don’t understand why this kind of work is interesting/important?

momom2 9 May 2023 5:35 UTC
3 points
on: Luna Lovegood and the Fidelius Curse—Part 12
I felt like I had a pretty good grasp on what was happening, but in the end I’m just as confused as at the beginning… ’^-^

momom2 19 Feb 2023 23:07 UTC
3 points
2
on: Hinton: “mortal” efficient analog hardware may be learned-in-place, uncopyable
The presentation starts at 3:50.

momom2 30 Jul 2022 6:32 UTC
3 points
in reply to: Gunnar_Zarncke’s comment on: Luna Lovegood and the Chamber of Secrets—Part 11
It’s not so easy, but this is the perspective of Luna. I for once really enjoy how the information (especially dialogiue) is dumbed down to what she perceives.
However they ambushed Harry is not relevant to what she thinks.

momom2 5 Aug 2021 16:45 UTC
3 points
in reply to: Caledonian2’s comment on: Variable Question Fallacies
I don’t think the two closed answers of “Have you stopped beating your wife ?” have such a well-defined meaning. Since this is natural language, and I understand a no as meaning “I’m still beating her.” and I expect most people to interpret a no the same way as I, then it’s not from obvious why this interpretation is incorrect (if we ignore that the sentence is typically used as an example that has no good answer. Use “Will you stop smoking soon ?” which is less standard for the sake of the argument.)

momom2

Cheat sheet of AI X-risk

[Question] What crite­rion would you use to se­lect com­pa­nies likely to cause AI doom?

[Question] How can there be a god­less moral world ?

Potential uses of the post:

Meta level critics:

More specific criticism:

[Question] Do you like ex­ces­sive sugar?

[Question] What criterion would you use to select companies likely to cause AI doom?

[Question] How can there be a godless moral world ?

[Question] Do you like excessive sugar?