AN APOLOGY ON BEHALF OF FOOLS FOR THE DETAIL ORIENTED
Misfits, hooligans, and rabble rousers.
Provocateurs and folk who don’t wear trousers.
These are my allies and my constituents.
Weak in number yet suffused with arcane power.
I would never condone bullying in my administration.
It is true we are at times moved by unkind motivations.
But without us the pearl clutchers, hard asses, and busy bees would overrun you.
You would lose an inch of slack per generation.
Many among us appreciate your precision.
I admit there are also those who look upon it with derision.
Remember though that there are worse fates than being pranked.
You might instead have to watch your friend be “reeducated”, degraded, and spanked
On high broadband public broadcast television.
We’re not so different really.
We often share your indignation
With those who despise copulation.
Although our alliance might be uneasy
We both oppose the soul’s ablation.
So let us join as cats and dogs, paw in paw
You will persistently catalog
And we will joyously gnaw.
Ronny Fernandez
Hey, I’m just some guy but I’ve been around for a while. I want to give you a piece of feedback that I got way back in 2009 which I am worried no one has given you. In 2009 I found lesswrong, and I really liked it, but I got downvoted a lot and people were like “hey, your comments and posts kinda suck”. They said, although not in so many words, that basically I should try reading the sequences closely with some fair amount of reverence or something.
I did that, and it basically worked, in that I think I really did internalize a lot of the values/tastes/habits that I cared about learning from lesswrong, and learned much more so how to live in accordance with them. Now I think there were some sad things about this, in that I sort of accidentally killed some parts of the animal that I am, and it made me a bit less kind in some ways to people who were very different from me, but I am overall glad I did it. So, maybe you want to try that? Totally fair if you don’t, definitely not costless, but I am glad that I did it to myself overall.
I didn’t figure out that the “bow” in “rainbow” referred to a bow like as in bow and arrow, and not a bow like a bow on a frilly dress, until five minutes ago. I was really pretty confused about this since I was like 8. Somebody could’ve explained but nobody did.
Ronny Fernandez’s Shortform
I want to note for posterity that I tried to write this reading list somewhat impartially. That is, I have a lot of takes about a lot of this stuff, and I tried to include a lot of material that I disagree with but which I have found helpful in some way or other. I also included things that people I trust have found helpful even if I personally never found it helpful.
I believe there isn’t really a deadline! You just buy tickets and then you can come. Tickets might sellout is the limiting factor.
MATS AI Safety Strategy Curriculum
In retrospect I think the above was insufficiently cooperative. Sorry,
To be clear, I did not think we were discussing the AI optimist post. I don’t think Nate thought that. I thought we were discussing reasons I changed my mind a fair bit after talking to Quintin.
I meant the reasonable thing other people knew I meant and not the deranged thing you thought I might’ve meant.
Ronny and Nate discuss what sorts of minds humanity is likely to find by Machine Learning
High schoolers can apply to the Atlas Fellowship: $10k scholarship + 11-day program
Yeah I’m totally with you that it definitely isn’t actually next token prediction, it’s some totally other goal drawn from the dist of goals you get when you sgd for minimizing next token prediction surprise.
I suppose I’m trying to make a hypothetical AI that would frustrate any sense of “real self” and therefore disprove the claim “all LLMs have a coherent goal that is consistent across characters”. In this case, the AI could play the “benevolent sovereign” character or the “paperclip maximizer” character, so if one claimed there was a coherent underlying goal I think the best you could say about it is “it is trying to either be a benevolent sovereign or maximize paperclips”. But if your underlying goal can cross such a wide range of behaviors it is practically meaningless! (I suppose these two characters do share some goals like gaining power, but we could always add more modes to the AI like “immediately delete itself” which shrinks the intersection of all the characters’ goals.)
Oh I see! Yeah I think we’re thinking about this really differently. Imagine there was an agent whose goal was to make little balls move according to some really diverse and universal laws of physics, for the sake of simplicity let’s imagine newtonian mechanics. So ok, there’s this agent that loves making these balls act as if they follow this physics. (Maybe they’re fake balls in a simulated 3d world, doesn’t matter as long as they don’t have to follow the physics. They only follow the physics because the agent makes them, otherwise they would do some other thing.)
Now one day we notice that we can arrange these balls in a starting condition where they emulate an agent that has the goal of taking over ball world. Another day we notice that by just barely tweaking the start up we can make these balls simulate an agent that wants one pint of chocolate ice cream and nothing else. So ok, does this system really have on coherent goal? Well the two systems that the balls could simulate are really different, but the underlying intelligence making the balls act according to the physics has one coherent goal: make the balls act according to the physics.
The underlying LLM has something like a goal, it is probably something like “predict the next token as well as possible” although definitely not actually that because of inner outer alignment stuff. Maybe current LLMs just aren’t mind like enough to decompose into goals and beliefs, that’s actually what I think, but some program that you found with sgd to minimize surprise on tokens totally would be mind like enough, and its goal would be some sort of thing that you find when you sgd to find programs that minimize surprise on token prediction, and idk, that could be like pretty much anything. But if you then made an agent by feeding this super LLM a prompt that sets it up to simulate an agent, well that agent might have some totally different goal, and it’s gonna be totally unrelated to the goals of the underlying LLMs that does the token prediction in which the other agent lives.
So the shoggoth here is the actual process that gets low loss on token prediction. Part of the reason that it is a shoggoth is that it is not the thing that does the talking. Seems like we are onboard here.
The shoggoth is not an average over masks. If you want to see the shoggoth, stop looking at the text on the screen and look at the input token sequence and then the logits that the model spits out. That’s what I mean by the behavior of the shoggoth.On the question of whether it’s really a mind, I’m not sure how to tell. I know it gets really low loss on this really weird and hard task and does it better than I do. I also know the task is fairly universal in the sense that we could represent just about any task in terms of the task it is good at. Is that an intelligence? Idk, maybe not? I’m not worried about current LLMs doing planning. It’s more like I have a human connectnome and I can do one forward pass through it with an input set of nerve activations. Is that an intelligence? Idk, maybe not?
I think I don’t understand your last question. The shoggoth would be the thing that gets low loss on this really weird task where you predict sequences of characters from an alphabet with 50,000 characters that have really weird inscrutable dependencies between them. Maybe it’s not intelligent, but if it’s really good at the task, since the task is fairly universal, I expect it to be really intelligent. I further expect it to have some sort of goals that are in some way related to predicting these tokens well.
The shoggoth is supposed to be a of a different type than the characters. The shoggoth for instance does not speak english, it only knows tokens. There could be a shoggoth character but it would not be the real shoggoth. The shoggoth is the thing that gets low loss on the task of predicting the next token. The characters are patterns that emerge in the history of that behavior.
Yeah I think this would work if you conditioned on all of the programs you check being exactly equally intelligent. Say you have a hundred superintelligent programs in simulations and one of them is aligned, and they are all equally capable, then the unaligned ones will be slightly slower in coming up with aligned behavior maybe, or might have some other small disadvantage.
However, in the challenge described in the post it’s going to be hard to tell a level 999 aligned superintelligence from a level 1000 unaligned superintelligence.
I think the advantage of the aligned superintelligence will only be slight because finding the action that maximizes utility function u is just as computationally difficult whether you yourself value u or not. It may not be equally hard for humans regardless of whether the human really values u, but I don’t expect that to generalize across all possible minds.
This inspired a full length post.
I think you should still write it. I’d be happy to post it instead or bet with you on whether it ends up negative karma if you let me read it first.