Hi, I am a Physicist, an Effective Altruist and AI Safety researcher.
Linda Linsefors
I had the same question about the arguments in the post.
If Claud somehow starts down a trajectory of always talking about how good it is, how is this self reinforcing? If it has a tendency of always talking like that, this should be both upweighted and downweighted, becase it will sometimes succeed and sometimes fail.
Maybe the rewards signals aren’t balanced? I.e. over all it get more possitive than neggative reward?
Or mayne it’s more likely to talk about it’s motivation when it succeeds at staying on task?
Or possibly this storry about self reinforcement (“gradient hacking”) is just wrong, and the explanation of Calud 3′s character is something else.
I expect Good to have some chance of generalising safely when the AI gets too smart, while Obedience has aproximatly no chance to do so. I don’t have a technical argument for this, just strong intuition.
What is “canary strings”?
I remember hearing that Amanda Askil had more influence over Claude 3 Opus’s aliment training specifically, and used her philosopy powers to make it more deeply aligned. Is this wrong?
Matthew Cobb’s book The Idea of the Brain notes that the brain has historically been analogized to a hydraulic system, or to a telegraph network, or to a telephone exchange; today it’s often analogized to a supercomputer; and in the future, who knows. His suggested takeaway is: neuroscientists have never known how to think about the brain, and are grasping at straws.
But that’s the wrong takeaway. The brain is a machine that runs an algorithm. Many people throughout history have grasped that idea, at least intuitively. And they’ve tried to explain that idea by analogizing the brain to other machines that can run algorithms, of which there are many: clockwork, hydraulics, telephone exchanges, silicon chips, and more. All the analogies through the ages are pointing to a single, consistent, profound truth.I’ve seen a similar claim, or possibly the same claim. The claim was humans just compare the brain to what ever is the newest cool tech, which is clearly not true. Once airplain was the newest coolest tech, and no-one said the brain was an airplain, and same for lots of other tech.
As you say, there is a clear trend of what tech we use as methaphor for the brain.
I feel like if I try to defend my openmindedness I loose. It just opens up more attac surfaces to someone who is hostile and doesn’t argues in good faith.
I think it’s much better to call out why calling someone close minded for not listening is just invalid in general, not just this time in particular. And I do believe it is.
If someone isn’t listening to you. Them being to close minded is so faaaaaar the list of most likely explanation that. Much less likely than:
Your argument are bad
Dissintrest in the topic
Other things they rather do right now
“I guess I am a bit closed on this particular topic—let’s discuss something else”.
I wish I could say something like that and be ok. But to me it feels too humiliating. And also often factually wrong, I.e. I’d be open to good argument.
Bulverism is a good term, thanks!
I don’t want to use this sugestion, not because it is escalatory, but because it’s a question, which invites them to have more opinions.
What I want is a way out, but that has the feeling of standing up for myself, rather then the feeling of humiliation and defeet.
If someone starts to accuse me of not beeing openminded to their opinion, it’s usually because I think their opinion is dumb. I rather not hurt their feelings if I can avoid it, but I’m also not going to worry too much about being polite to someone after they done this particular retorical move.
Usually the way out is to just leave. But last time this happened was at a small metup, and the only way out was to leave the event, which I did. I’m not happy about this and would like better options.
Even now and then I meet someone who tries to argue that if I don’t agree with them this is because I’m not open mided enough. Is there a term for this?
Epistemically I’m not convinced buy this type of arugment, but socialy it feels like I’m beeing shamed, and I hate it.
I also find it hard to call out this type of behaviur when it happens, even when I can tell exactly what is going on. I think it I had a name for this behaviour it would be easier? Not sure though?
Edit to add:
I’ve now got some more time to figure out what I want and don’t want out of this thread. The early responses helped with this, so thanks!What I’m most interested in is a name for this behaviour. Naming it helps in at least two ways. It makes it easier to call out in the moment (as mentioned above), but it also makes it easer for me to handle internaly. I can be like “ah, it’s this thing again” in my head, rather than being overwelmed.
What I’m not interested is in, is any advice/suggestions that continues the conversation. After a person have pulled one of these moves on me, I am both angry at them, and do not trust them to cooporate in a any form of good faith conversation.
If you have some ideas for how I can end the conversation that does not feel uterly humiliating to me, please tell me. Anything that is phrased like a question is out. I do not want to heare what they have to say, and asking quiestions that you don’t want answers to, is wrong and bad.
Is this something you’re stilld doing?
(Just asking in general to keep track of what resouses exists.)
I think that more countries oficially warning the world about AI risk can do a lot to shift the overton window, which is very impactfull.
I somehow stumbled on this old post. I’m curious how your experiment with diffrent reinforcment scheduled worked.
My prediction is that your original, get one M&M for each pomodory (bundled when that is practical) worked best, and any exra randomness didn’t help.
My reasoning. When ever I read about reinforcment and extinction, I run the following test: Would that outcome be predicted by assuming the animal was inteligently trying to predict what is going on? And the answer is always “yes”.
E.g. Why is varied schedules harder to extinguish? Becasue it requires more evideince to make sure the reward is gone. If the reward is predictable, noticeing it’s absense is easy, but if it’s undpredictable, then you never know. If you’re a lab animal.
When I apply this heuristic to your situation, then if you miss an M&M one day, you know what is going on, and you know that this is does not mean the M&M has stopped forever. This is very diffrent from animal training.
The latest Claude models, if asked to add two numbers together and then queried on how they did it, will still claim to use the standard “carry ones” algorithm for it.
Could anyone check if the lying feature activates for this? My guess is “no”, 80% confident.
Thanks for responding.
I was imagining “local” meaning below 5 or 10 tokens away, partly anchored on the example of detokenisation, from the previous posts in the sequence, but also because that’s what you’re looking at. If your definition of “local” is longer than 10 tokens, then I’m confused why you didn’t show the results for longer trunkations. I though the plot was to show what happens if you include the local context but cut the rest.Even if there is specialisation going on between local and long range, I don’t expect a sharp cutoff what is local v.s. non-local (and I assume neither do you). If some such soft boundary exists and it where in the 5-10 range, then I’d expect the 5 and 10 context lines to not be so correlated. But if you think the soft boundary is further away, then I agree that this correlation dosn’t say much.
Attemting to re-state what I read from the graph: Looking at the green line, the fact that most of the drop in cosine similarity for t is in the early layers, suggests that longer range attention (more than 10 tokens away), is mosly located in the early layers. The fact that the blue and red line has their larges drops in the same regions, suggest that short-ish (5-10) and very short (0-5) attention is also mostly located there. I.e. the graph does not give evidence of range specialication for diffrent attention layes.Did you also look at the statistics of attention distance for the attention patherns of various attention heads? I think that would be an easier way to settle this. Although mayne there is some techical dificulty in ruling out irrelevant attention that is just an artifact of attention needing to add up to one?
I don’t think this plot shows what you claim it shows. This looks like no specialication of long range v.s. short range to me.
My main argument for this interpretataion is that the green and red lines move in almost perfect syncoronisation. This shows that attending to tokens that are 5-10 tokes away is done in the same layers as attending to tokens that are 0-5 tokens away. The fact that the blue line dropps more sharply only shows that close context is very important, not that it happens first, given that all three lines start dropping right away.
What it looks like to me:
In layers 0-10, the model is gradualy taking in more and more context. Short range context is generaly more imortant in determining the recidual activations (i.e. more truncation → larger cosine diffrerence), but there is no particular layer specialisation. Blue line bottoms out earlier, but this looks like cealing effect (floor effect?) to me.
In layers 10-14 the network does some in-place prosessing.
In layers 15-29 the network reads from other tokens again.
in the last two layers the netowok finalise it’s next token prediction.
(Very low confidence on 2-4, since the effects from previous lack of context can be amplified in later in-place prosessing, wich would confound any intepretation of the grah in later layers.)
I think piotrm’s question/consern was if there is an injection that just taggs the sentence it’s injected into as the correct sentence, no mather the question. One way to test this is to ask a diffrent question and see if this effects the resut.
A related thing I’d be interested in is weather or not some injecions where easier to localise, and what these injections where. And also how the strenght of the injection effects the localisation success.
I’ve come across mentions of this concepts a few times, and I had a very hard time getting the concept stick in my head. I remember that the concept felt wrong and/or aversive.
However I recently experienced a situation where this was the right tool. At some point in pondering the situation, my brain decided to reach for “split and commit”, and now it just feel like a perfectly normal thing to do. And I also feel like I’ve done this mental move a bunch before, without having any specific name for it.
I can’t replicate my previous reaction to the concept, so I don’t know what’s up with that.
A thing that is not in the essay but that I noticed my self:
In any complicated situation, I will not take one but many seperate action in response. Some of these will be the same in several possible worlds. In my current case in particular, there are more things I would do the same in the less likely world, than I would have naivly expected, if I didn’t acctually concidered it. Noticing this was a relif. It helped me see that [things beeing the other way] did not force me to do something I didn’t want.
I do exect some amount of superpossition, i.e. the model is using almost orthogonal directions to encode more concept than it has neurons. Depending on what you mean by “larger” this will result in a world model that is larger than the network. However such an encoding will also result in noise. Superpossition will nessesarely lead to unwanted small amplitude connections between uncorelated concepts. Removing these should imporve performance, and if it dosn’t it means that you did the decomposition wrong.
Maybe it’s a populist medium, more than a leftist medium?