Yeah, that model sounds plausible to me (pending elaboration on how the friend-or-enemy parameter is updated). Thanks.
ScienceBall
Thanks for writing this series.
I can see how Approval Reward explains norm-following behavior. If people approve of honesty, then being honest will make people approve of me.
But I’m not totally convinced that Approval Reward is enough to explain norm-enforcement behavior on its own?
For some action or norm X, it doesn’t seem obvious to me that “doing X” and “punishing someone who does not-X” are equivalent in terms of human-approving, unless you already knew that humans punish others who do things they don’t like.
If you know that humans punish others who act contrary to norms that the humans value, then you can punish dishonest people to show that you value honesty, and then you’ll get an Approval Reward from other humans who value honesty.
But suppose that nobody already knew the pattern that humans would punish others who act contrary to norms that the humans value. Then when you see someone being dishonest (acting contrary to honesty), then you don’t know that “punishing this person for dishonesty will make others see that I value honesty”, and so you wouldn’t expect to get an Approval Reward. Therefore you wouldn’t be motivated to punish them (if Approval Reward was your only motivation). And if everyone thinks the same way, then nobody will do any punishments for approval’s sake, and so you won’t see any examples from which to learn the pattern.
So it seems to me that although Approval Reward can take norm-enforcement behavior that already exists and “keep it going” for a while, it must have taken some other motivation to “get it started”. In the case of harmful norm violations, the enforcement could have been caused by Sympathy Reward plus means-end reasoning (as you mentioned in another context). But I think humans sometimes punish people for even harmless norm violations (e.g. fashion crimes), so either that was caused by misgeneralization from harmful violations, or there’s some third motivation involved.
I’m not sure about this, though.
I can only see the first image you posted. It sounds like there should be a second image (below “This is not happening for his other chats:”) but I can’t see it.
In the graphic in section 3.5.2, you mention
Groups that are plausibly aggressive enough to unilaterally (and flagrantly-illegally) deliberately release a sovereign AI into the wild
What kind of existing laws were you thinking that this would violate? When you said “into the wild”, are you thinking of sending the AI to copy itself across the internet (by taking over other peoples’ computers), and this would violate laws about hacking? If the AI was just accessing websites like a human, or if the AI had a robot body and it went out onto the streets, I can’t immediately think of any laws that would violate.
Is the illegality dependant on the “sovereign” part? Is the illegality because of the actions the AI might need to take to prevent the creation of other AIs, and it’s a crime for the human group because they could foresee that the AI would do this?
James Miller discussed similar ideas.
The “ideas” link doesn’t seem to work.
About the example in section 6.1.3: Do you have an idea of how the Steering Subsystem can tell that Zoe is trying to get your attention with her speech? It seems to me like that requires both (a) identifying that the speech is trying to get someone’s attention, and (b) identifying that the speech is directed at you. (Well, I guess (b) implies (a) if you weren’t visibly paying attention to her beforehand.)
About (a): If the Steering Subsystem doesn’t know the meaning of words, then how can it tell that Zoe is trying to get someone’s attention? Is there some way to tell from the sound of the voice? Or is it enough to know that there were no voices before and Zoe has just started talking now, so she’s probably trying to get someone’s attention to talk to them? (But that doesn’t cover all cases when Zoe would try to get someone’s attention.)
About (b): If you were facing Zoe, then you could tell if she was talking to you. If she said your name, then maybe the Steering Subsystem might recognize your name (having used interpretability to get it from the Learning Subsystem?) and know she was talking to you? Are there any other ways the Steering Subsystem could tell if she was talking to you?
I’m not sure how many false positives vs. false negatives evolution will “accept” here, so I’m not sure how precise a check to expect.
I couldn’t click into it from the front page if I tried to click on the zone where the text content would normally go, but I was able to click into this from the front page if I clicked on the reply-count icon in the top-right corner. (But that wouldn’t have worked when there were zero replies.)
The UK government also heavily used AI chatbots to generate diagrams and citations for a report on the impact of AI on the labour market, some of which were hallucinated.
This link is broken.
Thank you for writing this series.
I have a couple of questions about conscious awareness, and a question about intuitive self-models in general. They might be out-of-scope for this series, though.
Questions 1 and 2 are just for my curiosity. Question 3 seems more important to me, but I can imagine that it might be a dangerous capabilities question, so I acknowledge you might not want to answer it for that reason.
In 2.4.2, you say that things can only get stored in episodic memory if they were in conscious awareness. People can sometimes remember events from their dreams. Does that mean that people have conscious awareness during (at least some of) their dreams?
Is there anything you can say about what unconsciousness is? i.e. Why is there nothing in conscious awareness during this state? - Is the cortex not thinking any (coherent?) thoughts? (I have not studied unconsciousness.)
About the predictive learning algorithm in the human brain—what types of incoming data does it have access to? What types of incoming data is it building models to predict? I understand that it would be predicting data from your senses of vision, hearing, and touch, etc. But when it comes to build an intuitive self-model, does it also have data that directly represents what the brain algorithm is doing (at some level)? Or does it have to infer the brain algorithm from its effect on the external sense data (e.g. motor control to change what you’re looking at)?
In the case of conscious awareness, does the predictive algorithm receive “the thought currently active in the cortex” as an input to predict? Or does it have to infer the thought when trying to predict something else?
It sounds like inter-branch communication would imply that you could do that. OP does mention that as an application: