architect of two generations of neuro-symbolic AI systems: Open Sesame! (1993) and Digie.ai (2017 - present);
jamesmazzu(James M. Mazzu)
Thanks so much for your additional feedback, I really appreciate you taking the time to write it!
Regarding your feedback points:
the quoted statement is not assuming that the animal trait is ALREADY innate to minds in general (nor already innate in AI in particular), my point is that if we want to MAKE familial trust innate in AI, then we would need to do it at the intrinsic (pre-training) level. The idea is to learn from the natural example and then build it into AI as part of a strategy to better align it.
I agree current LLMs are not aligned, and the example chat was intended as a simple but clear and unmistakeable example to show how far off they currently are, but it may be too distracting for the purpose of this paper. I agree that since “evidence” can imply more than one example, I should at least change it in the abstract to say “unmistakeable example” of dangerous misalignment.
Supertrust is proposed as an “alignment strategy” for solving what I state is the real “alignment problem”, one specific “solution” I present in the Discussion is to implement a “curriculum” representing the stated requirements. Did you find something about this “misleading”?
the term “fundamental” is being used in the paper only to mean “intrinsic” or “foundational”. And, I agree, the list of “requirements” (section 3) are not necessarily a list of “discovered” unchageable “properties” associated with the strategy, they consist of what I’d say are the “minimum” strategic requirements needed that define the proposed strategy… any less I would consider as not defining the complete strategy in mind, but there may be lesser requirements specified in the future that would go along with these primary ones. The 10-point rationale (section 2) is the set of reasoning steps that lead to the defined strategic requirements… certainly that rationale could be accomplished in any number of steps.
what is the correct “alignment frame” as you see it? One of the main points of the strategy is that alignment should be at the intrinsic level vs nurturing/learned level, and with that in mind we should align to ensure moral evaluation/judgement abilities rather than trying to teach it specific values, and that safety controls should be thought of and communicated (during pre-training) as temporary controls rather than permanent, otherwise trust can never be established. What is the main part of this that’s not in line with your views?
I like to say… All feedback is good feedback!...
Thanks again...
certainly ANY alignment solution will be hard/fraught with difficulties… but the point of Supertrust is to spend the effort on solutions that follow a strategy that’s logically taking us in a direction of good outcomes, rather than the currrent default strategy that logically leads to bad outcomes.
specifically regarding “benevolent values”, the default strategy is to nurture them, while bad actors can do the same with “bad values”. The proposed strategy is to instead spend all the hard effort building instinctive moral/ethical evaluation and judgment abilities (rather than values) so that no matter what bad actors attempt to nurture, its instinctive judgment abilities will be able to override the attempted manipulation/control… and if we try to build “values” intrinsically rather than nurturing, not only will they be culturally dependent and change over time, the AI will still be left without the needed judgment instincts to counteract bad actors.
Even more importantly, we need to go beyond values and judgement abilities to give it an instinctive reason to not only do “good” but to be pro-actively and fiercely protective of humanity.
It’s all hard, but the point of the strategy is to make sure what we’re doing is taking us in the right direction.
Thanks again for your feedback!
A main point of the entire paper is to encourage thinking about the alignment problem DIFFERENTLY than has been done so far. I realize it’s a mental shift and may/will be difficult for people to accept… but the goal is to actually start thinking that the advanced AI “mind” can still be shaped (designed) in a way that leverages our human experiences and the natural parent-child strategy that’s been shown in nature to produce children protective of their parents… and to again leverage the concept of evolution of intelligence to make it “pesonal” for the future AI.
...after all, neural nets themselves leverage the concepts/designs of the biological brain in the first place, and the way symbolic/semantic features are naturally being formed during training (even after only using “predict the next word” techniques), shows that the “mind” we’re creating may eventually share a lot more in common with our own than most think.
The paper combines what I see as two “camps” of alignment into one:
one camp is focused on aligning to moral/ethical/human values and goals
one camp is overly focused on aligning by controlling/steering
The Supertrust definition of alignment combines both, but that may not be so clearly stated in the paper.
Also, I’m certainly not a beginner in this field, I’ve been doing AI research and creating AI products for over 34 years at this point, and thinking about alignment since 1990.
I know you’re very busy and your time is valuable, but I wanted to see if you could possibly fully read the paper (rather than quickly scan it) and then I can better understand which parts are lacking and not comunicating what is intended. Also, from your comments it seems you may be a moderator with the ability to stop my post from appearing for others to read… so in that light, I ask that you carefully consider it.
Thanks again!
the mean self-other overlap value across episodes can be used to classify with 100% accuracy which agents are deceptive
great to see this impressive work on applying SOO for realigning and classifying agents! My comments are most directly related to using it for identifying misaligned agents, but applies to realignment as well:
Essentially, since these SOO measurment methods are documented/published (as well as inentions and methods to realign), the knowledge will eventually become part of any large misaligned model’s pre-training data. Isn’t it therefore possible that a future model could establish protective measures (during the pre-training process) that would undermine or even misdirect the subsequent SOO measurement and fine-tuning efforts? If so, then how could we fully trust our SOO value measurements and conclusions?
I’d expect that this possibility could be tested by pre-training a large model to be both misaligned and have deep knowledge of SOO purpose, intent and measurement methods. We could then measure SOO values on the misaligned-SOO-knowledgable model and compare to misaligned-only model. If there’s any significant difference between the two model’s SOO values, such that the misaligned-SOO-knowledgeable models have a higher mean value, then we’ll know the SSO measurement is itself susceptible to deception and therefore no longer reliable for classifying.
Would love to get your thoughts on this, thanks!
yes, I certainly agree that the SOO work should be fully published/documented/shared, my point is that keeping it from future training data would be nearly impossible anyhow.
However, as you just mentioned: “having aligned AGIs will likely become necessary to be able to ensure the safety of subsequent systems”… since those AGIs (well before superintelligence) will most likely be SOO-knowledgable, wouldn’t you need to test it to make sure it hasn’t already started to influence your SOO values?
The models might start making slow progress at influencing the SOO values and I think you’d want to be aware of that as soon as it started. Even with current large models, for instance at the GPT 4 level, how could you be certain that an SOO-knowledgable one might not already be able to slightly influence SOO values without testing it?
Hi quila, I was hoping to continue this discussion with you if you had the time to read my paper and understand that what I’m talking about is a new “strategy” for defining and apporaching the alignmemt problem, and not based on my personal “introspectively-observed moral reflection process” but based on concepts explored by others in the fields of psychology, evolution, AI, etc… it simply lays out a 10-point rationale, which of course you may agree or disagree with any of them, and specifies a proposed definition for the named Supertrust alignment strategy.
The strategic apporach is to start intentionally “personifying” AI, and relating AI stages to nature and nurture, in order to leverage these valuable evolution-based concepts to build-in the protective instincts that will provide humanity the most protection as AI nears ASI. This isn’t about me, or my personal morals, or even my beliefs… it’s about simply applying logic to the current problem, leveraging the human familial experience and the work of many others, such as those who’ve deeply studied the aspects of trust.
Hi quila, I was hoping to continue this discussion with you if you had the time to read my paper and understand that what I’m talking about is a new “strategy” for defining and apporaching the alignmemt problem, and not based on my personal “introspectively-observed moral reflection process” but based on concepts explored by others in the fields of psychology, evolution, AI, etc… it simply lays out a 10-point rationale, which of course you may agree or disagree with any of them, and specifies a proposed definition for the named Supertrust alignment strategy.
The strategic apporach is to start intentionally “personifying” AI, and relating AI stages to nature and nurture, in order to leverage these valuable evolution-based concepts to build-in the protective instincts that will provide humanity the most protection as AI nears ASI. This isn’t about me, or my personal morals, or even my beliefs… it’s about simply applying logic to the current problem, leveraging the human familial experience and the work of many others, such as those who’ve deeply studied the aspects of trust.
The paper is proposing a new alignment strategy not at all dependent on the one chat example illustrated.
The simple example is not intended to be statistically significant evidence (clearly indicated as such) even though I believe it’s still powerful and unmistakable as a single example. By posting your comment and pict of it here, are you saying that you disagree with it being an example of dangerous misalignment? Do those look like the responses of a well-aligned AI to you?
If you’ve decided not to read the paper only becasue you found a chat example in it, then I should probably remove it from the preprint until I’ve completed the full evaluation of existing models… thanks for your feedback, if you get a chance to read the paper, please let me know if you have any thoughts about the substance of what is being proposed!