Thanks for the feedback! In a follow-up, I can try creating various rewordings of the prompt for each value. But instead of just neutral rewordings, it seems like you are talking about the extent to which the tone of the prompt is implicitly encouraging behavior (output length) one way or the other, am I correct in interpreting that way? So e.g. have a much more subdued/neutral tone for the consciousness example?
Joe Kwon
[Question] Are there any groupchats for people working on Representation reading/control, activation steering type experiments?
Claude wants to be conscious
Does the median LW commenter believe that autoregressive LLMs will take us all the way to superintelligence?
Super cool stuff. Minor question, what does “Fraction of MLP progress” mean? Are you scaling down the MLP output values that get added to the residual stream? Thanks!
FWIW I understand now what it’s meant to do, but have very little idea how your protocol/proposal delivers positive outcomes in the world by emitting performative speech acts. I think explaining your internal reasoning/hypothesis for how emitting performative speech acts leads to powerful AI’s delivering positive incomes would be helpful.
Is such a “channel” necessary to deliver positive outcomes? Is it supposed to make it more likely that AI delivers positive outcomes? More details on what a success looks like to you here, etc.
I skimmed The Snuggle/Date/Slap Protocol and Ethicophysics II: Politics is the Mind-Savior which are two recent downvoted posts of yours. I think they get negative karma because they are difficult to understand and it’s hard to tell what you’re supposed to take away from it. They would probably be better received if the content were written such that it’s easy to understand what your message is at an object-level as well as what the point of your post is.
I read the Snuggle/Date/Slap Protocol and feel confused about what you’re trying to accomplish (is it solving AI Alignment?) and how the method is supposed to accomplish that.
In the ethicophysics posts, I understand the object level claims/material (like the homework/discussion questions) but fail to understand what the goal is. It seems like you are jumping to grounded mathematical theories for stuff like ethics/morality which immediately makes me feel dubious. It’s a too much, too grand, too certain kind of reaction. Perhaps you’re just spitballing/brainstorming some ideas, but that’s not how it comes across and I infer you feel deeply assured that it’s correct given statements like “It [your theory of ethics modeled on the laws of physics] therefore forms an ideal foundation for solving the AI safety problem.”
I don’t necessarily think you should change whatever you’re doing BTW just pointing out some likely reactions/impressions driving negative karma.
This is terrific. One feature that will be great to have, is a way to sort and categorize your predictions under various labels.
[Linkpost] Faith and Fate: Limits of Transformers on Compositionality
The Intrinsic Interplay of Human Values and Artificial Intelligence: Navigating the Optimization Challenge
Sexuality is, usually, a very strong drive which has a large influence over behaviour and long term goals. If we could create an alignment drive as strong in our AGI we would be in a good position.
I don’t think we’d be in a good position even if we instilled an alignment drive this strong in AGI
To me, the caveats section of this post highlights the limited scope from which language models will be able to learn human values and preferences, given explicitly stated (And even implied-from-text) goals != human values as a whole.
Hi Cameron, nice to see you here : ) what are your thoughts on a critique like: human prosocial behavior/values only look the way they look and hold stable within-lifetimes, insofar as we evolved in + live in a world where there are loads of other agents with roughly equal power as ourselves? Do you disagree with that belief?
This was very insightful. It seems like a great thing to point to, for the many newish-to-alignment people ideating research agendas (like myself). Thanks for writing and posting!
This is a really cool idea and I’m glad you made the post! Here are a few comments/thoughts:
H1: “If you give a human absolute power, there is a small subset of humans that actually cares and will try to make everyone’s life better according to their own wishes”How confident are you in this premise? Power and sense of values/incentives/preferences may not be orthogonal (and my intuition is that it isn’t). Also, I feel a little skeptical about the usefulness of thinking about the trait showing up more or less in various intelligence strata within humans. Seems like what we’re worried about is in a different reference class. Not sure.
H4 is something I’m super interested in and would be happy to talk about it in conversations/calls if you want to : )
Paper: Forecasting world events with neural nets
Something at the root of this might be relevant to the inverse scaling competition thing where they’re trying to find what things get worse in larger models. This might have some flavor of obviously wrongness → deception via plausible sounding things as models get larger? https://github.com/inverse-scaling/prize
interesting idea. like.. a mix of genuine sympathy/expansion of moral circle to AI, and virtue signaling/anti-corporation meme spreads to the majority population and effectively curtails AGI capabilities research? This feels like a thing that might actually do nothing to reduce corporations’ efforts to get to powerful AI unless it reaches a threshold at which point there’s very dramatic actions against corporations who continue to try to do that thing
I stream-of-consciousness’d this out and I’m not happy with how it turned out, but it’s probably better I post this than delete it for not being polished and eloquent. Can clarify with responses in comments.
Glad you posted this and I’m also interested in hearing what others say. I’ve had these questions for myself in tiny bursts throughout the last few months.
When I get the chance to speak to people earlier in their career stage than myself (starting undergrad, or is a high schooler attending a mathcamp I went to) who are undecided about their careers, I bring up my interest in AI Alignment and why I think it’s important, and share resources for them after the call in case they’re interested in learning more about it. I don’t have very many opportunities like this because I don’t actively seek to identify and “recruit” them. I only bring it up by happenstance (e.g. joining a random discord server for homotopy type theory, seeing an intro by someone who went to the same mathcamp as me and is interested in cogsci, and scheduling a call to talk about my research background in cogsci and how my interests have evolved/led me to alignment over time).
I know very talented people who are around my age at MIT and from a math program I attended; students who are breezing by technical double majors with perfect GPAs, IMO participants, good competitive programmers, etc. Some things that make it hard for me:
If I know them well, I can talk about my research interests and try to get them to see my motivation, but if I’m only catching up with them 1-2x a year, it feels very unnatural and synthetic for me to be spending that time trying to convert them into doing alignment work. If I am still very close to them / talk to them frequently, there’s still an issue of bringing it up naturally and having a chance to convince them. Most of these people are doing Math PhDs, or trading in finance, or working on a startup, or… The point is that they are fresh on their sprint down the path that they have chosen. They are all the type who are very focused and determined to succeed on the goals they have settled on. It is not “easy” to get them (or for this matter, almost any college student) to halt their “exploit” mode, take 10 steps back and lots of time from their busy lives, and then “explore” another option that I’m seemingly imposing onto them. FWIW, the people I know who are in trading seem to be the most likely to switch out (explicitly have told me in conversations that they just enjoy the challenge of the work, but want to find more fulfilling things down the road. And to these people I share ideas and resources about AI Safety.)
I shared resources after the call, talked about why I’m interested in alignment, and that’s the furthest I’ve gone wrt potentially converting someone who is already in a separate career track, to consider alignment.
If it was MUCH easier to convince people that ai alignment is worth thinking about in under an hour, and I could reach out to people to talk to me about this for a hour without looking like a nutjob and potentially damaging our relationship because it seems like I’m just trying to convert them into something else, AND the field of AI Alignment was more naturally compelling for them to join, I’d do much more of this outreach. On that last point, what I mean is: for one moment, let’s suspend the object level importance of solving AI Alignment. In reality, there are things that are incredibly important/attractive for people when pursuing a career. Status, monetary compensation, and recognition (and not being labeled a nutjob) are some big ones. If these things were better (and I think they are getting much better recently), it would be easier to get people to spend more time at least thinking about the possibility of working on AI Alignment, and eventually some would work on it because I don’t think the arguments for x-risk from AI are hard to understand. If I personally didn’t have so much support by way of programs the community had started (SERI, AISC, EA 1-1s, EAG AI Safety researchers making time to talk to me), or it felt like the EA/X-risk group was not at all “prestigious”, I don’t know how engaged I would’ve been in the beginning, when I started my own journey learning about all this. As much as I wish it weren’t true, I would not be surprised at all if the first instinctual thing that led me down this road was noticing that EAs/LW users were intelligent and had a solidly respectable community, before choosing to spend my time engaging with the content (a lot of which was about X-risks).
Makes sense, and I also don’t expect the results here to be surprising to most people.
What do you mean by this part? As in if it just writes very long responses naturally? There’s a significant change in the response lengths depending on whether it’s just the question (empirically the longest for my factual questions), a short prompt preceding the question, a longer prompt preceding the question, etc. So I tried to control for the fact that having any consciousness prompt means a longer input to Claude by creating some control prompts that have nothing to do with consciousness—in which case it had shorter responses after controlling for input length.
Basically because I’m working with an already RLHF’d model whose output lengths are probably most dominated by whatever happened in the preference tuning process, I try my best to account for that by having similar length prompts preceding the questions I ask.