What I’m curious about is how you balance this with the art of examining your assumptions.
Puzzle games are a good way of examining how my own mind works, and I often find that I go through an algorithm like:
Do I see the obvious answer?
What are a few straightforward things I could try?
Then Step 3 I see as similar to your maze-solving method:
What are the required steps to solve this? What elements constrain the search space?
But I often find that for difficult puzzles, a fourth step is required:
What assumptions am I making, that would lead me to overlook the correct answer if the assumption was false?
For instance, I may think a lever can only be pulled, and not pushed—or I may be operating under a much harder to understand assumption, like “In this maze, the only thing that matters are visual elements” when it turns out the solution to this puzzle actually involved auditory cues.
Concrete feedback signals I’ve received:
I don’t find myself excited about the work. I’ve never been properly nerd-sniped by a mechanistic interpretability problem, and I find the day-to-day work to be more drudgery than exciting, even though the overall goal of the field seems like a good one.
When left to do largely independent work, after doing the obvious first thing or two (“obvious” at the level of “These techniques are in Neel’s demos”) I find it hard to figure out what to do next, and hard to motivate myself to do more things if I do think of them because of the above drudgery.
I find myself having difficulty backchaining from the larger goal to the smaller one. I think this is a combination of a motivational issue and having less grasp on the concepts.
By contrast, in evaluations, none of this is true. I am able to solve problems more effectively, I find myself actively interested in problems, (the ones I’m working on and ones I’m not) and I find myself more able to solve problems and reason about how they matter for the bigger picture.
I’m not sure how much of each is a contributor, but I suspect that if I was sufficiently excited about the day-to-day work, all the other problems would be much more fixable. There’s a sense of reluctance, a sense of burden, that saps a lot of energy when it comes to doing this kind of work.
As for #2, I guess I should clarify what I mean, since there’s two ways you could view “not suited”.
I will never be able to become good enough at this for my funding to be net-positive. There are fundamental limitations to my ability to succeed in this field.
I should not be in this field. The amount of resources required to make me competitive in this field is significantly larger than other people who would do equally good work, and this is not true for other subfields in alignment.
I view my use of “I’m not suited” more like 2 than 1. I think there’s a reasonable chance that, given enough time with proper effort and mentorship in a proper organisational setting (being in a setting like this is important for me to reliably complete work that doesn’t excite me), I could eventually do okay at this field. But I also think that there are other people who would do better, faster, and be a better use of an organisation’s money than me.
This doesn’t feel like the case in evals. I feel like I can meaningfully contribute immediately, and I’m sufficiently motivated and knowledgable that I can understand the difference between my job and my mission (making AI go well) and feel confident that I can take actions to succeed in both of them.
If Omega came down from the sky and said “Mechanistic interpretability is the only way you will have any impact on AI alignment—it’s this or nothing” I might try anyway. But I’m not in that position, and I’m actually very glad I’m not.
Anecdotally I have also noticed this—when I tell people what I do, the thing they are frequently surprised by is that we don’t know how these things work.
As you implied, if you don’t understand how NN’s work, your natural closest analogue to ChatGPT is conventional software, which is at least understood by its programmers. This isn’t even people being dumb about it, it’s just a lack of knowledge about a specific piece of technology, and a lack of knowledge that there is something to know—that NN’s are in fact qualitatively different from other programs.
Yes, this is an argument people have made. Longtermists tend to reject it. First off, applying a discount rate on the moral value of lives in order to account for the uncertainty of the future is...not a good idea. These two things are totally different, and shouldn’t be conflated like that imo. If you want to apply a discount rate to account for the uncertainty of the future, just do that directly. So, for the rest of the post I’ll assume a discount rate on moral value actually applies to moral value.So, that leaves us with the moral argument.
A fairly good argument, and the one I subscribe to, is this:
Let’s say we apply a conservative discount rate, say, 1% per year, to the moral value of future lives.
Given that, one life now is worth approximately 500 million lives two millenia from now. (0.99^2000 = approximately 2e-9)
But would that have been reasonably true in the past? Would it have been morally correct to save a life in 0 BC at the cost of 500 million lives today?
If the answer is “no” to that, it should also be considered “no” in the present.
This is, again, different from a discount rate on future lives based on uncertainty. It’s entirely reasonable to say “If there’s only a 50% chance this person ever exists, I should treat it as 50% as valuable.” I think that this is a position that wouldn’t be controversial among longtermists.
For the Astra Fellowship, what considerations do you think people should be thinking about when deciding to apply for SERI MATS, Astra Fellowship, or both? Why would someone prefer one over the other, given they’re both happening at similar times?
The agent’s context includes the reward-to-go, state (i.e, an observation of the agent’s view of the world) and action taken for nine timesteps. So, R1, S1, A1, …. R9, S9, A9. (Figure 2 explains this a bit more) If the agent hasn’t made nine steps yet, some of the S’s are blank. So S5 is the state at the fifth timestep. Why is this important?
If the agent has made four steps so far, S5 is the initial state, which lets it see the instruction. Four is the number of steps it takes to reach the corridor where the agent has to make the decision to go left or right. This is the key decision for the agent to make, and the agent only sees the instruction at S5, so S5 is important for this reason.
Figure 1 visually shows this process—the static images in this figure show possible S5′s, whereas S9 is animation_frame=4 in the GIF—it’s fast, so it’s hard to see, but it’s the step before the agent turns.
I think there’s an aesthetic clash here somewhere. I have an intuition or like… an aesthetic impulse, telling me basically… “advocacy is dumb”. Whenever I see anybody Doing An Activism, they’re usually… saying a bunch of… obviously false things? They’re holding a sign with a slogan that’s too simple to possibly be the truth, and yelling this obviously oversimplified thing as loudly as they possibly can? It feels like the archetype of overconfidence.
This is exactly the same thing that I have felt in the past. Extremely well said. It is worth pointing out explicitly that this is not a rational thought—it’s an Ugh Field around advocacy, and even if the thought is true, that doesn’t mean all advocacy has to be this way.
I find this interesting but confusing. Do you have an idea for what mechanism allowed this? E.g: Are you getting more done per hour now than your best hours working full-time? Did the full-time hours fall off fast at a certain point? Was there only 15 hours a week of useful work for you to do and the rest was mostly padding?
I think this makes a lot of sense. While I think you can make the case for “fertility crisis purely as a means of preventing economic slowdown and increasing innovation” I think your arguments are good that people don’t actually often make this argument, and a lot of it does stem from “more people = good”.
But I think if you start from “more people = good”, you don’t actually have motivated reasoning as much as you suspect re: innovation argument. I think it’s more that the innovation argument actually does just work if you accept that more people = good. Because if more people = good, that means more people were good before penicillin and then are even more good afterwards, and these two don’t actually cancel each other out.
In summary, I don’t think that “more people = good” motivates the “Life is generally good to have, actually” argument—I think if anything it’s the other way around. People who think life is good tend to be more likely to think it’s a moral good to give it to others. The argument doesn’t say it’s “axiomatically good” to add more people, it’s “axiomatically good conditional on life being net positive”.
As for understanding why people might feel that way—my best argument is this.
Let’s say you could choose to give birth to a child who would be born with a terribly painful and crippling disease. Would it be a bad thing to do that? Many people would say yes.
Now, let’s say you could choose to give birth to a child who would live a happy and healthy positive life? Would that be a good thing? It seems that, logically, if giving birth to a child who suffers is bad, giving birth to a child who enjoys life is good.
That, imo, is the best argument for being in favor of more people if you think life is positive.
Note that I don’t think this means people should be forced to have kids or that you’re a monster for choosing not to, even if those arguments were true. You can save a life for 5k USD after all, and raising a kid yourself takes far more resources than that. Realistically, if my vasectomy makes me a bad person then I’m also a bad person for not donating every spare dollar to the AMF instead of merely 10%, and if that’s a “bad person” then the word has no meaning.
Okay, I think I see several of the cruxes here.
Here’s my understanding of your viewpoint:
“It’s utterly bizarre to worry about fertility. Lack of fertility is not going to be an x-risk anytime soon. We already have too many people and if anything a voluntary population reduction is a good thing in the relative near-term. (i.e, a few decades or so) We’ve had explosive growth over the last century in terms of population, it’s already unstable, why do we want to keep going?”
In a synchronous discussion I would now pause to see if I had your view right. Because that would take too much time in an asynchronous discussion, I’ll reply to the imaginary view I have in my head, while hoping it’s not too inaccurate. Would welcome corrections.
If this view of yours seems roughly right, here’s what I think are the viewpoint differences:
I think people who worry about fertility would agree with you that fertility is not an existential threat.
I think the intrinsic value of having more people is not an important crux—it is possible to have your view on Point 3 and still worry about fertility.
I think the “fertility crisis” is more about replacement than continued increase. It is possible that many of the people who worry about fertility would also welcome still more people, but I don’t think they would consider it a crisis if we were only at replacement rates, or close to it.
I think people who care about speed of innovation don’t just care about imposed population deadlines looming, but also about quality of life—if we had invented penicillin a century earlier, many people would have lived much longer, happier lives, for example. One could frame technological progress as a moral imperative this way. I’m not sure if this is a major crux, but I think there are people with a general “More people = good” viewpoint for this reason, even ignoring population ethics. You are right that we could use the people we have better, but I don’t see this as a mutually exclusive situation.
I think the people who worry about the fertility crisis would disagree with you about Point 4. I don’t think it’s obvious that “tech to deal with an older population” is actually easier than “tech to deal with a larger population”. It might be! Might not be.
While you may not agree with these ideas, I hope I’ve presented them reasonably and accurately enough that it makes the other side merely different, rather than bizarre and impossible to understand.
I would suggest responding with your points (Top 3-5, if you have too many to easily list) on why this is incredibly obviously not a problem, seeing where you get pushback if anywhere, and iterating from there. Don’t be afraid to point out “incredibly obvious” things—it might not be incredibly obvious to other people. And if you’re genuinely unsure why anyone could think this is a problem, the responses to your incredibly obvious points should give you a better idea.
I think Tristan is totally right, and it puts an intuition I’ve had into words. I’m not vegan—I am sympathetic to the idea of having this deep emotional dislike of eating animals, I feel like the version of me who has this is a better person, and I don’t have it. From a utilitarian perspective I could easily justify just donating a few bucks to animal charities...but veganism isn’t about being optimally utilitarian. I see it as more of a virtue ethics thing. It’s not even so much that I want to be vegan, but I want to be the kind of person who chooses it. But I’m not sufficiently good of a person to actually do it, which does make me feel somewhat guilty at times. As a salve to my conscience, I’ve recently decided to try giving up chicken entirely, which seems like a solid step forward that is still pretty easy to make.
One of the core problems of AI alignment is that we don’t know how to reliably get goals into the AI—there are many possible goals that are sufficiently correlated with doing well on training data that the AI could wind up optimising for a whole bunch of different things.
Instrumental convergence claims that a wide variety of goals will lead to convergent subgoals such that the agent will end up wanting to seek power, acquire resources, avoid death, etc.
These claims do seem a bit...contradictory. If goals are really that inscrutable, why do we strongly expect instrumental convergence? Why won’t we get some weird thing that happens to correlate with “don’t die, keep your options open” on the training data, but falls apart out of distribution?
I found an error in the application—when removing the last item from the blacklist, every page not whitelisted is claimed to be blacklisted. Adding an item back to the blacklist fixes this. Other than that, it looks good!
Interesting. That does give me an idea for a potentially useful experiment! We could finetune GPT-4 (or RLHF an open source LLM that isn’t finetuned, if there’s one capable enough and not a huge infra pain to get running, but this seems a lot harder) on a “helpful, harmless, honest” directive, but change the data so that one particular topic or area contains clearly false information. For instance, Canada is located in Asia.
Does the model then:
Deeply internalise this new information? (I suspect not, but if it does, this would be a good sign for scalable oversight and the HHH generalisation hypothesis)
Score worse on honesty in general even in unrelated topics? (I also suspect not, but I could see this going either way—this would be a bad sign for scalable oversight. It would be a good sign for the HHH generalisation hypothesis, but not a good sign that this will continue to hold with smarter AI’s)
One hard part is that it’s difficult to disentangle “Competently lies about the location of Canada” and “Actually believes, insomuch as a language model believes anything, that Canada is in Asia now”, but if the model is very robustly confident about Canada being in Asia in this experiment, trying to catch it out feels like the kind of thing Apollo may want to get good at anyway.
In current user-facing LLMs like ChatGPT or Claude, the closest approximation to goals may be being helpful, harmless, and honest.
According to my understanding of RLHF, the goal-approximation it trains for is “Write a prompt that is likely to be rated as positive”. In ChatGPT / Claude, this is indeed highly correlated with being helpful, harmless, and honest, since the model’s best strategy for getting high ratings is to be those things. If models are smarter than us, this may cease to be the case, as being maximally honest may begin to conflict with the real goal of getting a positive rating. (e.g, if the model knows something the raters don’t, it will be penalised for telling the truth, which may optimise for deceptive qualities) Does this seem right?
I don’t really understand how your central point applies here. The idea of “money saves lives” is not supposed to be a general rule of society, but rather a local point about Alice and Bob—namely, donating ~5k will save a life. That doesn’t need to be always true under all circumstances, there just needs to be some repeatable action that Alice and Bob can take (e.g, donating to the AMF) that costs 5k for them that reliably results in a life being saved. (Your point about prolonging life is true, but since the people dying of malaria are generally under 5, the amount of QALY’s produced is pretty close to an entire human lifetime)
It doesn’t really matter, for the rest of the argument, how this causal relationship works. It could be that donating 5k causes more bednets to be distributed, it could be that donating 5k allows for effective lobbying to improve economic growth to the value of one life, or it could be that the money is burnt in a sacrificial pyre to the God of Charitable Sacrifices, who then descends from the heavens and miraculously cures a child dying of malaria. From the point of view of Alice and Bob, the mechanism isn’t important if you’re talking on the level of individual donations.
In other words, Alice and Bob are talking on the margins here, and on the margin, 5k spent equals one live saved, at least for now.
Not quite, in my opinion. In practice, humans tend to be wrong in predictable ways (what we call a “bias”) and so picking the best option isn’t easy.
What we call “rationality” tends to be the techniques / thought patterns that make us more likely to pick the best option when comparing alternatives.