Something at the root of this might be relevant to the inverse scaling competition thing where they’re trying to find what things get worse in larger models. This might have some flavor of obviously wrongness → deception via plausible sounding things as models get larger? https://github.com/inverse-scaling/prize
Joe Kwon
I’m also not sold on this specific part, and I’m really curious about what things support the idea. One reason I don’t think it’s good to rely on this as the default expectation though, is that I’m skeptical about humans’ abilities to even know what the “best experience” is in the first place. I wrote a short rambly post touching on, in some part, my worries about online addiction: https://www.lesswrong.com/posts/rZLKcPzpJvoxxFewL/converging-toward-a-million-worlds
Basically, I buy into the idea that there are two distinct value systems in humans. One subconscious system where the learning is mostly from evolutionary pressures, and one conscious/executive system that cares more about “higher-order values” which I unfortunately can’t really explicate. Examples of the former: craving sweets, addiction to online games with well engineered artificial fulfillment. Example of the latter: wanting to work hard, even when it’s physically demanding or mentally stressful, to make some type of positive impact for broader society.
And I think today’s modern ML systems are asymmetrically exploiting the subconscious value system at the expense of the conscious/executive value system. Even knowing all this, I really struggle to overcome instances of akrasia, controlling my diet, not drowning myself in entertainment consumption, etc. I feel like there should be some kind of attempt to level the playing field, so to speak, with which value system is being allowed to thrive. At the very least, transparency and knowledge about this phenomena to people who are interacting with powerful recommender (or just general) ML systems, and in the optimal, allowing complete agency and control over what value system you want to prioritize, and to what extent.
This was very insightful. It seems like a great thing to point to, for the many newish-to-alignment people ideating research agendas (like myself). Thanks for writing and posting!
This is a really cool idea and I’m glad you made the post! Here are a few comments/thoughts:
H1: “If you give a human absolute power, there is a small subset of humans that actually cares and will try to make everyone’s life better according to their own wishes”How confident are you in this premise? Power and sense of values/incentives/preferences may not be orthogonal (and my intuition is that it isn’t). Also, I feel a little skeptical about the usefulness of thinking about the trait showing up more or less in various intelligence strata within humans. Seems like what we’re worried about is in a different reference class. Not sure.
H4 is something I’m super interested in and would be happy to talk about it in conversations/calls if you want to : )
Hi John. One could run useful empirical experiments right now, before fleshing out all these structures and how to represent them, if you can assume that a proxy for human representations (crude: conceptnet, less crude: similarity judgments on visual features and classes collected by humans) is a good enough proxy for “relevant structures” (or at least that these representations more faithfully capture the natural abstractions than the best machines in vision tasks for example, where human performance is the benchmark performance), right?
I had a similar idea about ontology mismatch identification via checking for isomorphic structures, and also realized I had no idea how to realize that idea. Through some discussions with Stephen Casper and Ilia Sucholutsky, we kind of pivoted the above idea into the regime of interpretability/adversarial robustness where we are hunting for interesting properties given that we can identify the biggest ways that humans and machines are representing things differently (and that humans, for now, are doing it “better”/more efficiently/more like the natural abstraction structures that exist).
I think am working in the same building this summer (caught a split-second glance at you yesterday); I would love a chance to discuss how selection theorems might relate to an interpretability/adversarial robustness project I have been thinking about.
Enjoyed reading this! Really glad you’re getting good research experience and I’m stoked about the strides you’re making towards developing research skills since our call (feels like ages ago)! I’ve been doing a lot of what you describe as “directed research” myself lately as I’m learning more about DL-specific projects and I’ve been learning much faster than when I was just doing cursory, half-assed paper skimming, alongside my cogsci projects. Would love to catch up over a call sometime to talk about stuff we’re working on now
Really appreciated this post and I’m especially excited for post 13 now! In the past month or two, I’ve been thinking about stuff like “I crave chocolate” and “I should abstain from eating chocolate” as being a result of two independent value systems (one whose policy was shaped by evolutionary pressure and one whose policy is… idk vaguely “higher order” stuff where you will endure higher states of cortisol to contribute to society or something).
I’m starting to lean away from this a little bit, and I think reading this post gave me a good idea of what your thoughts are, but it’d be really nice to get confirmation (and maybe clarification). Let me know if I should just wait for post 13. My prediction is that you believe there is a single (not dual) generator of human values, which are essentially moderated at the neurochemical level, like “level of dopamine/serotonin/cortisol”. And yet, this same generator, due to our sufficiently complex “thought generator”, can produce plans and thoughts such as “I should abstain from eating chocolate” even though it would be a dopamine hit in the short-term, because it can simulate forward much further down the timeline, and believes that the overall neurochemical feedback will be better than caving into eating chocolate, on a longer time horizon. Is this correct?
If so, do you believe that because social/multi-agent navigation was essential to human evolution, the policy was heavily shaped by social world related pressures, which means that even when you abstain from the chocolate, or endure pain and suffering for a “heroic” act, in the end, this can all still be attributed to the same system/generator that also sometimes has you eat sugary but unhealthy foods?
Given my angle on attempting to contribute to AI Alignment is doing stuff to better elucidate what “human values” even is, I feel like I should try to resolve the competing ideas I’ve absorbed from LessWrong: 2 distinct value systems vs. singular generator of values. This post was a big step for me in understanding how the latter idea can be coherent with the apparent contradictions between hedonistic and higher-level values.
Thanks for posting this! I was wondering if you might share more about your “isolation-induced unusual internal information cascades” hypothesis/musings! Really interested in how you think this might relate to low-chance occurrences of breakthroughs/productivity.
My original idea (and great points against the intuition by Rohin)
“To me, it feels viscerally like I have the whole argument in mind, but when I look closely, it’s obviously not the case. I’m just boldly going on and putting faith in my memory system to provide the next pieces when I need them. And usually it works out.”
This closely relates to the kind of experience that makes me think about language as post hoc symbolic logic fitting to the neural computations of the brain. Which kinda inspired the hypothesis of a language model trained on a distinct neural net being similar to how humans experience consciousness (and gives the illusion of free will).
So, I thought it would be a neat proof of concept if GPT3 served as a bridge between something like a chess engine’s actions and verbal/semantic level explanations of its goals (so that the actions are interpretable by humans). e.g. bishop to g5; this develops a piece and pins the knight to the king, so you can add additional pressure to the pawn on d5 (or something like this).
In response, Reiichiro Nakano shared this paper: https://arxiv.org/pdf/1901.03729.pdf
which kinda shows it’s possible to have agent state/action representations in natural language for Frogger. There are probably glaring/obvious flaws with my OP, but this was what inspired those thoughts.Apologies if this is really ridiculous—I’m maybe suggesting ML-related ideas prematurely & having fanciful thoughts. Will be studying ML diligently to help with that.
For the basic features, I got used to navigating everything within a hour. I’ll be on the lookout for improvements to Roam or other note-taking programs like this
This is terrific. One feature that will be great to have, is a way to sort and categorize your predictions under various labels.
interesting idea. like.. a mix of genuine sympathy/expansion of moral circle to AI, and virtue signaling/anti-corporation meme spreads to the majority population and effectively curtails AGI capabilities research? This feels like a thing that might actually do nothing to reduce corporations’ efforts to get to powerful AI unless it reaches a threshold at which point there’s very dramatic actions against corporations who continue to try to do that thing
Sorry if it’s obvious from some other part of your post, but the whole premise is that sufficiently strong models *deployed in sufficiently complex environments* leads to general intelligence with optimization over various levels of abstractions. So why is it obvious that: It doesn’t matter if your AI is only taught math, if it’s a glorified calculator — any sufficiently powerful calculator desperately wants to be an optimizer?
If it’s only trained to solve arithmetic and there are no additional sensory modalities aside from the buttons on a typical calculator, how does increasing this AI’s compute/power lead to it becoming an optimizer over a wider domain than just arithmetic? Maybe I’m misunderstanding the claim, or maybe there’s an obvious reason I’m overlooking.
Also, what do you think of the possibility that when AI becomes superhuman++ in tasks, that the representations go from interpretable to inscrutable again (because it uses lower level representations that are inaccessible to humans)? I understand the natural abstraction hypothesis, and I buy it too, but even an epsilon increase in details might compound into significant prediction outcomes if a causal model is trying to use tons of representations in conjunction to compute something complex.
Do you think it might be valuable to find a theoretical limit that shows that the amount of compute needed for such epsilon-details to be usefully incorporated is greater than ever will be feasible (or not)?
Makes sense, and I also don’t expect the results here to be surprising to most people.
Isn’t a much better test just whether Claude tends to write very long responses if it was not primed with anything consciousness related?
What do you mean by this part? As in if it just writes very long responses naturally? There’s a significant change in the response lengths depending on whether it’s just the question (empirically the longest for my factual questions), a short prompt preceding the question, a longer prompt preceding the question, etc. So I tried to control for the fact that having any consciousness prompt means a longer input to Claude by creating some control prompts that have nothing to do with consciousness—in which case it had shorter responses after controlling for input length.
Basically because I’m working with an already RLHF’d model whose output lengths are probably most dominated by whatever happened in the preference tuning process, I try my best to account for that by having similar length prompts preceding the questions I ask.
Thanks for the feedback! In a follow-up, I can try creating various rewordings of the prompt for each value. But instead of just neutral rewordings, it seems like you are talking about the extent to which the tone of the prompt is implicitly encouraging behavior (output length) one way or the other, am I correct in interpreting that way? So e.g. have a much more subdued/neutral tone for the consciousness example?
Does the median LW commenter believe that autoregressive LLMs will take us all the way to superintelligence?
Super cool stuff. Minor question, what does “Fraction of MLP progress” mean? Are you scaling down the MLP output values that get added to the residual stream? Thanks!
Hi Cameron, nice to see you here : ) what are your thoughts on a critique like: human prosocial behavior/values only look the way they look and hold stable within-lifetimes, insofar as we evolved in + live in a world where there are loads of other agents with roughly equal power as ourselves? Do you disagree with that belief?