Software engineer transitioned into AI safety, teaching and strategy. Particularly interested in psychology, game theory, system design, economics.
Jonathan Claybrough
What is this trying to solve ?
If you try to setup an ML system to be a mimic, can you insure you don’t get an inner misaligned mesa optimizer ?
In general, what part of AI safety is this supposed to help with? A new design for alignable AGI? A design to slap onto existing ML systems, like a better version of RLHF?
I definitely agree on thinking deeply on how relevant different approaches are, but I think framing it as “relevance over tractability” is weird because tractability makes something relevant.
Maybe Deep learning neural networks really are too difficult to align and we won’t solve the technical alignment problem by working on them, but interpretability could still be highly relevant by proving that this is the case. We currently do not have hard information to rule out AGI from scaling up transformers, nor that it’s impossible to align for self modification (though uh, I wouldn’t count on it being possible, I would recommend not allowing self modification etc). More information is useful if it helps activates the breaks on developing these dangerous AI.
In the world where all AI alignment people are convinced DL is dangerous so work on other things, then the the world gets destroyed by DL people who were never shown that it’s dangerous. The universe won’t guarantee other people will be reasonable, so AI safety folk have all responsibility of proving danger etc.
light downvoted but explaining why to give opportunity to reply and disagree.
Meta level :
- Lack of any explanation, just references some locally appreciated thing
- yet it had already 8 upvotes which look like ingroup “I got that reference” instead of “This comment is beneficial to LW”
- analogies are bad if you don’t give their boundaries. If I say “x is like y” without specifying along which properties or axis it’s generally low information.
on object level :
- I don’t see polyamory as being much of an answer to “avoid monopolies on your emotional needs”.
- It kinda maps to “diversify one’s investments” on a surface level but I’d say you expose yourself to more risk with polyamory than not, while diversifying is supposed to reduce risks.
The two examples were (mostly) unrelated and served to demonstrate two cases where a perfect text predictor needs to do incredibly complex calculation to correctly predict text. Thus a perfect text predictor is vast superintelligence (and we won’t achieve perfect text prediction, but as we get better and better we might get closer to superintelligence)
In the first case, if the training data contains series of [hash] then [plain text], then a correct predictor must be able to retrieve the plain text from the hash (and because there are multiple plain texts with the same hash, it would have to calculate through all of them and evaluate which is most probable to appear). Thus correctly predicting text can mean being able to calculate an incredibly large amount of hashes on all combinations of text of certain lengths and evaluating which is the most probable.
In the second case, the task is to predict future papers based on past papers, which is kinda obviously very hard.
I’d say lack of empathy/comprehension of the situations and events is the problem, not the lack of emotional response. One can be neuro-divergent yet learn about others and care about them, and have appropriate reactions (without that needing to be feeling an emotion).
On [1], I’d say it’s more prosocial to test as you did and travel anyway taking maximum precaution than to not test, because though you’d want to take maximum precaution it’s harder without the clear confirm.
On the general subject, I’ll say that at this point I also prefer living in your world, traveling in a plane of people following your decision process. Most people have been vaccinated, most people have already had Covid, and afaik there are no particular strains on the medical system right now ? I don’t particular fear being slightly exposed to Covid again.
I model 1) as meaning people have high expectations and are mean in comments criticizing things ?
I am unsure about what your reasons for 2) are—Is it close to “the comments will be so low quality I’ll be ashamed of having seen them and that other humans are like that” ?
I expect 3) to be about your model of how difficult it is to improve the EA forum, and meaning that you think it’s not worth investing time in trying to make it better ?
As an open question, I’m curious about what you’ve previously seen on EA Forum which makes you expect bad thing from it. Hostile behaviour ? Bad epistemics ?
This is evidence that the thing you described exists, everyday, even in this more filtered community. Sorry about that.
(The following 3 paragraphs use guesswork and near psychoanalyzing which is sorta distateful—I do it openly because I think it’s got truth to it and I want to be corrected if it’s not. Also hopefully it makes Duncan feel more seen (and I want to know if it’s not the case))
It feels JBlack’s reaction is part of the symptom being described. JBlack is similar in enough ways to have been often ostracized and has come up with a way that’s fine for them to deal with it, and then write “It just doesn’t seem to me to be a big thing to get upset about”, ie. “there exists no one for whom its legitimate to get upset about” ie. “you don’t exist Duncan”. I imagine for you Duncan that’s a frustrating answer when it’s exactly the problem you were trying to convey. (I feel john’s comment is much more appropriate about looking at the problem and saying they can see different solutions without saying that it should apply to you).
I’m interested in why “the thing” was not conveyed to JBlack.
One important dimension to differ on is the “intuitively/fundamentally altruistic”. If you are high on that dimension, some things about other people matter in of themselves (and you don’t walk in the Nozick Experience Machine (necessary, not sufficient)). When someone else says they experience this or that, then (as long as you don’t have more evidence that they’re lying/mistaken) you care and believe them. You start from their side and try to build using their models a solution. In this mode, I read your (Duncan) post and am like “hm, I empathize to many parts, I could feel I understand him. But he’s warning strongly that he keeps being misunderstood and not seen, so I’m going to trust him, and keep in mind that my model of him is imprecise/incorrect in many directions and degrees. I’ll keep this in mind in my writing, suggesting models and wanting to get feedback”.
I assume JBlack is not so high on the “intuitively/fundamentally altruistic” dimension and processes the world through their experience (I mean this in a stronger way than the tautologically true one, that JBlack discount what others say of their experience strongly based on if it corresponds to their own) and to some extent don’t care about understanding Duncan. So they don’t.
I’m saying this because if it’s the case, Duncan’s shrug is appropriate, there’s not much point in trying to reach people who don’t care, it’s not sad to not reach someone who’s unreachable.
For information I’d also qualify Said’s statement as unkind (because of “saying it out loud is trite”) if I modeled him as having understood or caring about Duncan and his point, but because that’s not the case I understand Duncan just seeing it as not useful.
“Rude” is a classification depending on shared social norms. On LW I don’t think people are supposed to care about you, the basic assumption is more Rand like individuals who trade ideas because it’s positive sum. That a lot of people happen to be nice is a nice surprise, but it’s not expected, and I have gotten value from Said’s comments in many places over time so I feel the LW norm makes sense.
Is it simple if you don’t have infinite compute ?
I would be interested in a description which doesn’t rely on infinite compute, or more strictly still, that is is computationally tractable. This constraint is important to me because I assume that the first AGI we get is using something that’s more efficient that other known methods (eg. using DL because it works, even though it’s hard to control), so I care about aligning the stuff which we’ll actually be using.
Even if we haven’t found a goal that will be (strongly) beneficial to humanity, it seems useful knowing “how to make an AI target a distinct goal” because we can at least make it have limited impact and not take over. There’s gradients of success on both problems, and having solved the first does mean we can do slightly positive things even if we can’t do very strongly positive things.
I don’t particularly understand and you’re getting upvoted so I’d appreciate clarification, here are some prompts if you want :
- Is you deciding that you will concentrate on doing a hard task (solving a math problem) pointing your cognition, and is it viscerally disgusting ?
- Is you asking a friend a favor to do your math homework pointing their cognition and is it viscerally disgusting ?
- Is you convincing by true arguments someone else into doing a hard task that benefits them pointing their cognition, and is it viscerally disgusting ?
- Is a CEO giving directions to his employees so they spend their days working on specific task pointing their cognition, and is it viscerally disgusting ?
- Is you having a child and training them to be a chess grandmaster (eg. Judith Polgar) pointing their cognition, and is it viscerally disgusting ?
- Is you brainwashing someone into a particular cult where they will dedicate their life to one particularly repetitive task pointing their cognition, and is it viscerally disgusting ?
- Is you running a sorting algorithm on a list pointing the computer’s cognition, and is it viscerally disgusting ?
I’m hoping to get info on what problem you’re seeing (or esthetics?), why it’s a problem, and how it could be solved. I gave many examples where you could interpret my questions as being rhetoric—that’s not the point. It’s about identifying at which point you start differing.
Just tagging I’ve intuitively used a similar approach for a long time, but adding the warning that there definitely are corrosive aspects to it, where everyone else loses value and get disrespected. Your subcomment delved into finer details valuably so I think you’re aware of this.
Overall my favorite solution has been something like “I expect others to be mostly wrong, so I’m not surprised or hurt when they are, but I try to avoid mentally categorizing them in a degrading fashion” for most people. Everything is bad to an extent, everyone is bad to an extent, I can deal with it and try to make the world better.
I don’t think there’s anyone who I admire/respect enough that I don’t expect them to make mistakes of the kind Duncan’s pointing at, so I’m not bothered even if it come from people I like or I think are competent on some other things.
Thanks for the answers. It seems they mostly point to you valuing stuff like freedom/autonomy/self-realization, and that violations of that are distasteful. I think your answers are pretty reasonable and though I might not have exact same level of sensitivity I agree with the ballpark and ranking (brainwashing is worse than explaining, teaching chess exclusively feels a little too heavy handed..)
So where our intuitions differ is probably that you’re applying these heuristics about valuing freedom/autonomy/self-realization to AI systems we train ? Do you see them as people, or more abstractly as moral patients (because of them probably being conscious or something)?
I won’t get into moral weeds too fast, I’d point out that though I do currently mostly believe consciousness and moral patienthood is quite achievable “in silico”, that doesn’t mean that all intelligent system is conscious or a moral patient, and we might create AGI that isn’t of that kind. If you suppose AGI is conscious and a moral patient, then yeah I guess you can argue against it being pointed somewhere, but I’d mostly counter argue from moral relativism that “letting it point anywhere” is not fundamentally more good than “pointed somewhere”, so because we exist and have morals, let’s point it to our morals anyway.
I don’t dispute that at some point in time we want to solve alignment (to come out of the precipice period), but I disputed it’s more dangerous to know how to point AI before having solved what perfect goal to give it.
In fact, I think it’s less dangerous because we at minimum gain more time, to work and solve alignment, and at best can use existing near human-level AGI to help us solve alignment too. The main reason to believe this is to reason that near human-level AGI is a particular zone where we can detect deception, where it can’t easily unbox itself and takeover, yet is still useful. The longer we stay in this zone, the more relatively safe progress we can make (including on alignment)
Mostly unrelated—I’m curious about the page you linked to https://acsresearch.org/
As far as I see this is a fun site with a network simulation without any explanation. I’d have liked to see an about page with the stated goals of acs (or simply a link to your introductory post) so I can point to that site when talking about you.
I generally explain my interest in doing good and considering ethics (despite being anti realist) something like your point 5, and I don’t agree with or fully get your refutation that it’s not a good explanation, so I’ll engage with it and hope for clarifications.
My position, despite anti-realism and moral relativism, is I do happen to have values (which I can “personal values”, they’re mine and I don’t think there’s an absolute reason for anyone else to have them, though I will advocate for them to some extent) and epistemics (despite the problem of the criterion) that have initialized in a space where I want to do Good, I want to know what is Good, I want to iterate at improving my understanding and actions doing Good.
A quick question—when you say “Personally, though, I collect stamps”, do you mean your personal approach to ethics is descriptive and exploratory (and you’re collecting stamps in the sense of physics vs stamp collection image), and that you don’t identify as systematizer ?
I wouldn’t identify as “systematizer for its sake” either, it’s not a terminal value, but it’s an instrumental value for achieving my goal of doing Good. I happen to have priors and heuristics saying I can do more Good by systematizing better so I do, and I get positive feedback from it so I continue this.
Re “conspicuous absence of subject-matter”—true for an anti realist considering “absolute ethics”, but this doesn’t stop an anti realist considering what they’ll call “my ethics”. There can be as much subject-matter there as in realist absolute ethics, because you can simulate absolute ethics in “my ethics” with : “I epistemically believe there is no true absolute ethics, but my personal ethics is that I should adopt what I imagine would be the absolute real ethics if it existed”. I assume this is an existing theorized position but not knowing if it already has another standard name, I call this being a “quasi realist”, which is how I’d describe myself currently.
I don’t buy Anti realists treating consistency as absolute, so there’s nothing to explain. I view valuing consistency as being instrumental and it happens to win all the time (every ethics has it) because of the math that you can’t rule out anything otherwise. I think the person who answers “eh, I don’t care that much about being ethically consistent” is correct that it’s not in their terminal values, but miscalculates (they actually should value it instrumentally), it’s a good mistake to point out.
I agree that someone who tries to justify their intransitivities by saying “oh I’m inconsistent” is throwing out the baby with the bathwater when they could simply say “I’m deciding to be intransitive here because it better fits my points”. Again, it’s a good mistake to point out.
I see anti realists as just picking up consistency because it’s a good property to have for useful ethics, not because “Ethics” forced it onto them (it couldn’t, it doesn’t exist).
On the final paragraph, I would write my position as : “I do ethics, as an anti-realist, because I have a brute, personal preference to Doing Good (a cluster of helping other people, reducing suffering, anything that stems from Veil of Uncertainty which is intuitively appealing), and that this is self reinforcing (I consider it Good to want to do Good and to improve and doing Good), so I want to improve my ethics. There exists zones of value space where I’m in the dark and have no intuition (eg. population ethics/repugnant conclusion) so I use good properties (consistency, ..) to craft a curve which extends my ethics, not because of personal preference for blah-structural-properties, but by belief that this will satisfy my preferences to Doing Good the best”.
If a dilemma comes up pitting object level stakes and some abstract structural constraint, I weigh my belief that my intuition on “is this Good” is correct against my belief that “the model of ethics I constructed from other points is correct” and I’ll probably update one or both. Because of the problem of the criterion, I’m neither gonna trust my ethics or my data points as absolute. I have uncertainty on the position of all my points and on the best shape of the curve, so sometimes I move my estimate of the point position because it fits the curve better, and sometimes I move the curve shape because I’m pretty sure the point should be there.
I hope that’s a fully satisfying answer to “Why do ethical anti-realists do ethics”.
I wouldn’t say there’s an absolute reason why ethical anti-realists should do ethics.
Cool that you wanna get involved! I recommend the most important thing to do is coordinate with other people already working on AI safety, because they might have plans and projects already going on you can help with, and to avoid the unilateralist’s curse.
So, a bunch of places to look into to both understand the field of AI safety better and find people to collaborate with :
http://aisafety.world/tiles/ (lists different people and institutions working on AI safety)
https://coda.io/@alignmentdev/alignmentecosystemdevelopment (lists AI safety communities, you might join some international ones or local ones near you)
I have an agenda around outreach (convincing relevant people to take AI safety seriously) and think it can be done productively, though it wouldn’t look like ‘screaming on the rooftops’, but more expert discussion with relevant evidence.
I’m happy to give an introduction to the field and give initial advice on promising directions, anyone interested dm me and we can schedule that.
Other commenters have argued about the correctness of using Shoggoth. I think it’s mostly a correct term if you take it in the Lovcraftian sense, and that currently we don’t understand LLMs that much. Interpretability might work and we might progress so we’re not sure they actually are incomprehensible like Shoggoths (though according to wikipedia, they’re made of physics, so probably advanced civilizations could get to a point where they could understand them, the analogy holds surprisingly well!)
Anyhow it’s a good meme and useful to say “hey, we don’t understand these things as well as you might imagine from interacting with the smiley face” to describe our current state of knowledge.
Now for trying to construct some idea of what it is.
I’ll argue a bit against calling an LLM as a pile of masks, as that seems to carry implications which I don’t believe in. The question we’re asking ourselves is something like “what kind of algorithms/patterns do we expect to see appear when an LLM is trained? Do those look like a pile of masks, or some more general simulator that creates masks on the fly?” and the answer depends on specifics and optimization pressure. I wanna sketch out different stages we could hope to see and understand better (and I’d like for us to test this empirically and find out how true this is). Earlier stages don’t disappear, as they’re still useful at all times, though other things start weighing more in the next token predictions.
Level 0 : Incompetent, random weights, no information about the real world or text space.
Level 1 “Statistics 101” : Dumb heuristics doesn’t take word positions into account.
It knows facts about the world like token distribution and uses that.
Level 2 “Statistics 201″ : Better heuristics, some equivalent to Markov chains.
Its knowledge of text space increases, it produces idioms, reproduces common patterns. At this stage it already contains huge amount of information about the world. It “knows” stuff like mirrors are more likely to break and cause 7 years of jinx.
Level 3 “+ Simple algorithms”: Some pretty specific algorithms appear (like Indirect Object Identification), which can search for certain information and transfer it in more sophisticated ways. Some of these algorithms are good enough they might not be properly described as heuristics anymore, but instead really representing the actual rules as strongly as they exist in language (like rules of grammar properly applied). Note these circuits appear multiple times and tradeoff against other things so overall behavior is still stochastic, there are heuristics on how much to weight these algorithms and other info.
Level 4 “Simulating what created that text” : This is where it starts to have more and more powerful in context learning, ie. its weights represent algorithms which do in context search (and combines with its vast encyclopedic knowledge of texts, tropes, genres) and figure out consistencies in characters or concepts introduced in the prompt. For example it’ll pick up on Alice and Bobs’ different backgrounds, latent knowledge on them, their accents.
But it only does that because that’s what authors generally do, and it has the same reasoning errors common to tropes. That’s because it simulates not the content of the text (the characters in the story), but the thing which generates the story (the writer, who themselves have some simulation of the characters).
So uh, do masks or pile of masks fit anywhere in this story ? Not that I see. The mask is a specific metaphor for the RLHF finetunning which causes mode collapse and makes the LLM mostly only play the nice assistant (and its opposites). It’s a constraint or bridle or something, but if the training is light (doesn’t affect the weights too much), then we expect the LLM to mostly be the same, and that was not masks.
Nor are there piles of masks. It’s a bunch of weights really good at token prediction, learning more and more sophisticated strategies for this. It encodes stereotypes at different places (maybe french=seduction or SF=techbro), but I don’t believe these map out to different characters. Instead, I expect it at level 4, there’s a more general algorithm which pieces together the different knowledge, that it in context learns to simulate certain agents. Thus, if you just take mask to mean “character”, the LLM isn’t a pile of them, but a machine which can produce them on demand.
(In this view of LLMs, x-risk happens because we feed some input where the LLM simulates an agentic deceptive self aware agent which steers the outputs until it escapes the box)
Fantastic post, particularly interesting to me as since last December I decided to pivot and work on AI safety (it seems the most useful thing to do). In the spirit of sharing different perspectives, I will also share what I tried and failed at doing since then, so someone else might avoid my failures and recognize more easily if your suggested plan (hobbies while in stable job) might not work for them (as it didn’t for me).
The TL-DR: My job drains too much energy from me to be productive on AI safety in my spare time (I tried and failed), nor can I easily get another job with money and worklife-balance. Thus, I prefer quitting my job and doing AI safety full time with no funding, until either I can get paid doing that, or I have to return to a job (but at least I will have produced something instead of nothing).
More context : I’m a software engineer doing backend development (code interacting with databases) in a startup. My strongest passions are game theory, psychology, education, ethics. If there was no urgency to AGI as x-risk, I would work on improving education and ethics (same vibe as the sequences wanting to train rationalists). I’m recently graduated from engineering school and have been working for close to 2 years with a decent salary. I discovered late in my studies that I don’t like working as a programmer (because https://www.lesswrong.com/posts/x6Kv7nxKHfLGtPJej/the-topic-is-not-the-content). I have a personality where it seems I can only do good work on things that interest me (this is normal to an extent for everyone, I’m pointing it’s stronger than normal for me to the point I often cannot do mental work on something which doesn’t interest me (even though I can force myself to do physical work)).
So my diploma and experience are for a job I don’t enjoy and which drains me. The usual circumstances of working for a company (working around certain time slots, optimizing your work for company profit over utility) are also a burden. It was in this context I decided last December to try pivoting into AI alignment research and I was also fortunate to have regular discussions with one AI alignment researcher on my topic, what I wanted to learn and write (he deserves credit for that effort and I’m only not mentioning him because my failure is my own). I found a couple papers I enjoyed, chose one to analyze in detail, and chose the goal of publishing a short LW article doing some further work on that paper. I totally failed doing that. The full list of reasons is of course diverse and complicated, but the main one is I put little time into it, and the time I had was low energy. I could follow that rate for a year and produce almost no value. Other reason for failure : I mistakenly chose a paper which wasn’t centrally about AI safety, so the work I produced was pretty useless for solving the actual problems of AI safety.
So I abandoned that plan. Instead, I negotiated a raise in salary, kept my living expenses low and saved money. I’m now at the point where I can quit my job (and am planning to do so soon) and actually try AI safety research in suitable conditions, with the only caveat being limited time. Doing this, I hope to know much better how good a personal fit I have to alignment research and to other related work (governance, pedagogy) and choose the rest of my path accordingly.
In general, I would advise trying the hobby path first, while staying aware it won’t work for everyone (a failure in being productive in something as a hobby doesn’t mean you won’t be good at it as a job). If it doesn’t work, you have several others options as presented in the post, and I contribute the one I am currently doing : gain enough money to get some real free time (probably at least 3 months, 6 should be enough, probably no point to more than 12) and try doing that thing you wanted to do.