Something at the root of this might be relevant to the inverse scaling competition thing where they’re trying to find what things get worse in larger models. This might have some flavor of obviously wrongness → deception via plausible sounding things as models get larger? https://github.com/inverse-scaling/prize
Joe Kwon
[Linkpost] Faith and Fate: Limits of Transformers on Compositionality
Converging toward a Million Worlds
I’m also not sold on this specific part, and I’m really curious about what things support the idea. One reason I don’t think it’s good to rely on this as the default expectation though, is that I’m skeptical about humans’ abilities to even know what the “best experience” is in the first place. I wrote a short rambly post touching on, in some part, my worries about online addiction: https://www.lesswrong.com/posts/rZLKcPzpJvoxxFewL/converging-toward-a-million-worlds
Basically, I buy into the idea that there are two distinct value systems in humans. One subconscious system where the learning is mostly from evolutionary pressures, and one conscious/executive system that cares more about “higher-order values” which I unfortunately can’t really explicate. Examples of the former: craving sweets, addiction to online games with well engineered artificial fulfillment. Example of the latter: wanting to work hard, even when it’s physically demanding or mentally stressful, to make some type of positive impact for broader society.
And I think today’s modern ML systems are asymmetrically exploiting the subconscious value system at the expense of the conscious/executive value system. Even knowing all this, I really struggle to overcome instances of akrasia, controlling my diet, not drowning myself in entertainment consumption, etc. I feel like there should be some kind of attempt to level the playing field, so to speak, with which value system is being allowed to thrive. At the very least, transparency and knowledge about this phenomena to people who are interacting with powerful recommender (or just general) ML systems, and in the optimal, allowing complete agency and control over what value system you want to prioritize, and to what extent.
This was very insightful. It seems like a great thing to point to, for the many newish-to-alignment people ideating research agendas (like myself). Thanks for writing and posting!
This is a really cool idea and I’m glad you made the post! Here are a few comments/thoughts:
H1: “If you give a human absolute power, there is a small subset of humans that actually cares and will try to make everyone’s life better according to their own wishes”How confident are you in this premise? Power and sense of values/incentives/preferences may not be orthogonal (and my intuition is that it isn’t). Also, I feel a little skeptical about the usefulness of thinking about the trait showing up more or less in various intelligence strata within humans. Seems like what we’re worried about is in a different reference class. Not sure.
H4 is something I’m super interested in and would be happy to talk about it in conversations/calls if you want to : )
Hi John. One could run useful empirical experiments right now, before fleshing out all these structures and how to represent them, if you can assume that a proxy for human representations (crude: conceptnet, less crude: similarity judgments on visual features and classes collected by humans) is a good enough proxy for “relevant structures” (or at least that these representations more faithfully capture the natural abstractions than the best machines in vision tasks for example, where human performance is the benchmark performance), right?
I had a similar idea about ontology mismatch identification via checking for isomorphic structures, and also realized I had no idea how to realize that idea. Through some discussions with Stephen Casper and Ilia Sucholutsky, we kind of pivoted the above idea into the regime of interpretability/adversarial robustness where we are hunting for interesting properties given that we can identify the biggest ways that humans and machines are representing things differently (and that humans, for now, are doing it “better”/more efficiently/more like the natural abstraction structures that exist).
I think am working in the same building this summer (caught a split-second glance at you yesterday); I would love a chance to discuss how selection theorems might relate to an interpretability/adversarial robustness project I have been thinking about.
Enjoyed reading this! Really glad you’re getting good research experience and I’m stoked about the strides you’re making towards developing research skills since our call (feels like ages ago)! I’ve been doing a lot of what you describe as “directed research” myself lately as I’m learning more about DL-specific projects and I’ve been learning much faster than when I was just doing cursory, half-assed paper skimming, alongside my cogsci projects. Would love to catch up over a call sometime to talk about stuff we’re working on now
Really appreciated this post and I’m especially excited for post 13 now! In the past month or two, I’ve been thinking about stuff like “I crave chocolate” and “I should abstain from eating chocolate” as being a result of two independent value systems (one whose policy was shaped by evolutionary pressure and one whose policy is… idk vaguely “higher order” stuff where you will endure higher states of cortisol to contribute to society or something).
I’m starting to lean away from this a little bit, and I think reading this post gave me a good idea of what your thoughts are, but it’d be really nice to get confirmation (and maybe clarification). Let me know if I should just wait for post 13. My prediction is that you believe there is a single (not dual) generator of human values, which are essentially moderated at the neurochemical level, like “level of dopamine/serotonin/cortisol”. And yet, this same generator, due to our sufficiently complex “thought generator”, can produce plans and thoughts such as “I should abstain from eating chocolate” even though it would be a dopamine hit in the short-term, because it can simulate forward much further down the timeline, and believes that the overall neurochemical feedback will be better than caving into eating chocolate, on a longer time horizon. Is this correct?
If so, do you believe that because social/multi-agent navigation was essential to human evolution, the policy was heavily shaped by social world related pressures, which means that even when you abstain from the chocolate, or endure pain and suffering for a “heroic” act, in the end, this can all still be attributed to the same system/generator that also sometimes has you eat sugary but unhealthy foods?
Given my angle on attempting to contribute to AI Alignment is doing stuff to better elucidate what “human values” even is, I feel like I should try to resolve the competing ideas I’ve absorbed from LessWrong: 2 distinct value systems vs. singular generator of values. This post was a big step for me in understanding how the latter idea can be coherent with the apparent contradictions between hedonistic and higher-level values.
Thanks for posting this! I was wondering if you might share more about your “isolation-induced unusual internal information cascades” hypothesis/musings! Really interested in how you think this might relate to low-chance occurrences of breakthroughs/productivity.
My original idea (and great points against the intuition by Rohin)
“To me, it feels viscerally like I have the whole argument in mind, but when I look closely, it’s obviously not the case. I’m just boldly going on and putting faith in my memory system to provide the next pieces when I need them. And usually it works out.”
This closely relates to the kind of experience that makes me think about language as post hoc symbolic logic fitting to the neural computations of the brain. Which kinda inspired the hypothesis of a language model trained on a distinct neural net being similar to how humans experience consciousness (and gives the illusion of free will).
So, I thought it would be a neat proof of concept if GPT3 served as a bridge between something like a chess engine’s actions and verbal/semantic level explanations of its goals (so that the actions are interpretable by humans). e.g. bishop to g5; this develops a piece and pins the knight to the king, so you can add additional pressure to the pawn on d5 (or something like this).
In response, Reiichiro Nakano shared this paper: https://arxiv.org/pdf/1901.03729.pdf
which kinda shows it’s possible to have agent state/action representations in natural language for Frogger. There are probably glaring/obvious flaws with my OP, but this was what inspired those thoughts.Apologies if this is really ridiculous—I’m maybe suggesting ML-related ideas prematurely & having fanciful thoughts. Will be studying ML diligently to help with that.
For the basic features, I got used to navigating everything within a hour. I’ll be on the lookout for improvements to Roam or other note-taking programs like this
The Intrinsic Interplay of Human Values and Artificial Intelligence: Navigating the Optimization Challenge
[Question] Partial-Consciousness as semantic/symbolic representational language model trained on NN
Claude wants to be conscious
[Question] Value of building an online “knowledge web”
This is terrific. One feature that will be great to have, is a way to sort and categorize your predictions under various labels.
Hi Cameron, nice to see you here : ) what are your thoughts on a critique like: human prosocial behavior/values only look the way they look and hold stable within-lifetimes, insofar as we evolved in + live in a world where there are loads of other agents with roughly equal power as ourselves? Do you disagree with that belief?