For in-person conversations (I know this was meant as a norm for public discourse): Personally I tend to have a hard time digging into my memories for “data points” when I have a negative or positive impression of some person. It’s kind of the same thing with people asking you “What have you been working on the past week?” – I basically never remember anything immediately (even though I do work on stuff). This creates asymmetric incentives where it’s easier to make negative judgments seem unjustified or at least costly to bring up, which can contribute to a culture where justified critical opinions almost never reach enough of a consensus to change something. I definitely think there should be norms similar to the one described in the post, but I also think that there are situations (e.g., if a person has a reliable track record or if they promise to write a paragraph with some bullet points later on once they had time to introspect) were the norm should be less strict than “back the judgment up immediately or retract it.” And okay, probably one can manage to say a few words even on the spot because introspection is not that slow and opaque, but my point is simply that “This sounds unconvincing” is just as cheap a thing to say as cheap criticism, and the balance should be somewhere in between. So maybe instead of “justify” the norm should say something like “gesture at the type of reasons,” and that should be the bare minimum and more transparency is often preferable. (Another point is that introspecting on intuitive judgments helps refine them, so that’s something that people should do occasionally even if they aren’t being put on the spot to back something up.)
Needless to say, lax norms around this can be terrible in social environments where some people tend to talk too negatively about others and where the charitable voices are less frequent, so I think it’s one of those things where the same type of advice can sometimes be really good, and other times can be absolutely terrible.
I’m reluctant to reply because it sounds like you’re looking for rebuttals by explicit proponents of hard takeoff who have thought a great deal about takeoff speeds, and neither of that applies to me. But I could sketch some intuitions why reading the pieces by AI Impacts and by Christiano hasn’t felt wholly convincing to me. (I’ve never run these intuitions past anyone and don’t know if they’re similar to cruxes held by proponents of hard takeoff who are more confident in hard takeoff than I am – therefore I hope people don’t update much further against hard takeoff in case they find the sketch below unconvincing.) I found that it’s easiest for me to explain something if I can gesture towards some loosely related “themes” rather than go through a structured argument, so here are some of these themes and maybe people see underlying connections between them:
Shulman and Sandberg have argued that one way to get hard takeoff is via hardware overhang: when a new algorithmic insight can be used immediately to its full potential, because much more hardware is available than one would have needed to overtake state of the art performance metric with the new algorithms. I think there’s a similar dynamic at work with culture: If you placed an AGI into the stone age, it would be inefficient at taking over the world even with appropriately crafted output channels because stone age tools (which include stone age humans the AGI could manipulate) are neither very useful nor reliable. It would be easier for an AGI to achieve influence in 1995 when the environment contained a greater variety of increasingly far-reaching tools. But with the internet being new, particular strategies to attain power (or even just rapidly acquire knowledge) were not yet available. Today, it is arguably easier than ever for an AGI to quickly and more-or-less single-handedly transform the world.
There’s a sense in which cavemen are similarly intelligent as modern-day humans. If we time-traveled back into the stone age, found the couples with the best predictors for having gifted children, gave these couples access to 21st century nutrition and childbearing assistance, and then took their newborns back into today’s world where they’d grow up in a loving foster family with access to high-quality personalized education, there’s a good chance some of those babies would grow up to be relatively ordinary people of close to average intelligence. Those former(?) cavemen and cavewomen would presumably be capable of dealing with many if not most aspects of contemporary life and modern technology.
However, there’s also a sense in which cavemen are very unintelligent compared to modern-day humans. Culture, education, possibly even things like the Flynn effect, etc. – these really do change the way people think and act in the world. Cavemen are incredibly uneducated and untrained concerning knowledge and skills that are useful in modern, tool-rich environments.
We can think of this difference as the difference between the snapshot of someone’s intelligence at the peak of their development and their (initial) learning potential. Caveman and modern-day humans might be relatively close to each other in terms of the latter, but when considering their abilities at the peak of their personal development, the modern humans are much better at achieving goals in tool-rich environments. I sometimes get the impression that proponents of soft takeoffs underappreciate this difference when addressing comparisons between, for instance, early humans and chimpanzees (this is just a vague general impression which doesn’t apply to the arguments presented by AI impacts or by Paul Christiano).
Both for productive engineers and creative geniuses, it holds that they could only have developed their full potential because they picked up useful pieces of insight from other people. But some people cannot tell the difference between high-quality information and low-quality information, or might make wrong use even of high-quality information, reasoning themselves into biased conclusions. An AI system capable of absorbing the entire internet but terrible at telling good ideas from bad ideas won’t make too much of a splash (at least not in terms of being able to take over the world). But what about an AI system just slightly above some cleverness threshold for adopting an increasingly efficient information diet? Couldn’t it absorb the internet in a highly systematic way rather than just soaking in everything indiscriminately, learning many essential meta-skills on its way, improving how it goes about the task of further learning?
If the child in the chair next to me in fifth grade was slightly more intellectually curious, somewhat more productive, and marginally better dispositioned to adopt a truth-seeking approach and self-image than I am, this could initially mean they score 100%, and I score 95% on fifth-grade tests – no big difference. But as time goes on, their productivity gets them to read more books, their intellectual curiosity and good judgment get them to read more unusually useful books, and their cleverness gets them to integrate all this knowledge in better and increasingly more creative ways. I’ll reach a point where I’m just sort of skimming things because I’m not motivated enough to understand complicated ideas deeply, whereas they find it rewarding to comprehend everything that gives them a better sense of where to go next on their intellectual journey. By the time we graduate university, my intellectual skills are mostly useless, while they have technical expertise in several topics, can match or even exceed my thinking even on areas I specialized in, and get hired by some leading AI company. The point being: an initially small difference in dispositions becomes almost incomprehensibly vast over time.
(I realized that in this title/paragraph, the word “knowing” is meant both in the sense of “knowing how to do x” and “being capable of executing x very well.” It might be useful to try to disentangle this some more.) The standard AI foom narrative sounds a bit unrealistic when discussed in terms of some AI system inspecting itself and remodeling its inner architecture in a very deliberate way driven by architectural self-understanding. But what about the framing of being good at learning how to learn? There’s at least a plausible-sounding story we can tell where such an ability might qualify as the “secret sauce” that gives rise to a discontinuity in the returns of increased AI capabilities. In humans – and admittedly this might be too anthropomorphic – I’d think about it in this way: If my 12-year-old self had been brain-uploaded to a suitable virtual reality, made copies of, and given the task of devouring the entire internet in 1,000 years of subjective time (with no aging) to acquire enough knowledge and skill to produce novel and for-the-world useful intellectual contributions, the result probably wouldn’t be much of a success. If we imagined the same with my 19-year-old self, there’s a high chance the result wouldn’t be useful either – but also some chance it would be extremely useful. Assuming, for the sake of the comparison, that a copy clan of 19-year olds can produce highly beneficial research outputs this way, and a copy clan of 12-year olds can’t, what does the landscape look like in between? I don’t find it evident that the in-between is gradual. I think it’s at least plausible that there’s a jump once the copies reach a level of intellectual maturity to make plans which are flexible enough at the meta-level and divide labor sensibly enough to stay open to reassessing their approach as time goes on and they learn new things. Maybe all of that is gradual, and there are degrees of dividing labor sensibly or of staying open to reassessing one’s approach – but that doesn’t seem evident to me. Maybe this works more as an on/off thing.
It makes sense to be somewhat suspicious about any hypotheses according to which the evolution of general intelligence made a radical jump in Homo sapiens, creating thinking that is “discontinuous” from what came before. If knowing how to learn is an on/off ability that plays a vital role in the ways I described above, how could it evolve?We’re certainly also talking culture, not just genes. And via the Baldwin effect, natural selection can move individuals closer towards picking up surprisingly complex strategies via learning from their environment. At this point at latest, my thinking becomes highly speculative. But here’s one hypothesis: In its generalization, this effect is about learning how to learn. And maybe there is something like a “broad basin of attraction” (inspired by Christiano’s broad basin of attraction for corrigibility) for robustly good reasoning / knowing how to learn. Picking up some of the right ideas initially and early on, combined with being good at picking up things in general, produces in people an increasingly better sense of how to order and structure other ideas, and over time, the best human learners start to increasingly resemble each other, having honed in on the best general strategies.
For most people, the returns of self-improvement literature (by which I mean not just productivity advice, but also information on “how to be more rational,” etc.) might be somewhat useful, but rarely life-changing. People don’t tend to “go foom” from reading self-improvement advice. Why is that, and how does it square with my hypothesis above, that “knowing how to learn” could be a highly valuable skill with potentially huge compounding benefits? Maybe the answer is that the bottleneck is rarely knowledge about self-improvement, but rather the ability to make the best use of such knowledge? This would support the hypothesis mentioned above: If the critical skill is finding useful information in a massive sea of both useful and not-so-useful information, that doesn’t necessarily mean that people will get better at that skill if we gave them curated access to highly useful information (even if it’s information about how to find useful information, i.e., good self-improvement advice). Maybe humans don’t tend to go foom after receiving humanity’s best self-improvement advice because too much of that is too obvious for people who were already unusually gifted and then grew up in modern society where they could observe and learn from other people and their habits. However, now imagine someone who had never read any self-improvement advice, and could never observe others. For that person, we might have more reason to expect them to go foom – at least compared to their previous baseline – after reading curated advice on self-improvement (or, if it is true that self-improvement literature is often somewhat redundant, even just from joining an environment where they can observe and learn from other people and from society). And maybe that’s the situation in which the first AI system above a certain critical capabilities threshold finds itself. The threshold I mean is (something like) the ability to figure out how to learn quickly enough to then approach the information on the internet like the hypothetical 19-year olds (as opposed to the 12-year olds) from the thought experiment above.
(This argument is separate from all the other arguments above.) Here’s something I never really understood about the framing of the hard vs. soft takeoff discussion. Let’s imagine a graph with inputs such as algorithmic insights and compute/hardware on the x-axis, and general intelligence (it doesn’t matter for my purposes whether we use learning potential or snapshot intelligence) on the y-axis. Typically, the framing is that proponents of hard takeoff believe that this graph contains a discontinuity where the growth mode changes, and suddenly the returns (for inputs such as compute) are vastly higher than the outside view would have predicted, meaning that the graph makes a jump upwards in the y-axis. But what about hard takeoff without such a discontinuity? If our graph starts to be steep enough at the point where AI systems reach human-level research capabilities and beyond, then that could in itself allow for some hard (or “quasi-hard”) takeoff. After all, we are not going to be sampling points (in the sense of deploying cutting-edge AI systems) from that curve every day – that simply wouldn’t work logistically even granted all the pressures to be cutting-edge competitive. If we assume that we only sample points from the curve every two months, for instance, is it possible that for whatever increase in compute and algorithmic insights we’d get in those two months, the differential on the y-axis (some measure of general intelligence) could be vast enough to allow for attaining a decisive strategic advantage (DSA) from being first? I don’t have strong intuitions about what the offense-defense balance will shift to once we are close to AGI, but it at least seems plausible that it turns more towards offense, in which case arguably a lower differential is needed for attaining a DSA. In addition, based on the classical arguments put forward by researchers such as Bostrom and Yudkowsky, it also seems at least plausible to me that we are potentially dealing with a curve that is very steep around the human level. So, if one AGI project is two months ahead of another project, and we for the sake of argument assume that there are no inherent discontinuities in the graph in question, it’s still not evident to me that this couldn’t lead to something that very much looks like hard takeoff, just without an underlying discontinuity in the graph.
Leaning on this, someone could write a post about the “infectiousness of realism” since it might be hard to reconcile openness to non-zero probabilities of realism with anti-realist frameworks? :P
For people who believe their actions matter infinitely more if realism is true, this could be modeled as an overriding meta-preference to act as though realism is true. Unfortunately if realism isn’t true this could go in all kinds of directions depending on how the helpful AI system would expect to get into such a judged-to-be-wrong epistemic state.
Probably you were thinking of something like teaching AIs metaphilosophy in order to perhaps improve the procedure? This would be the main alternative I see, and it does feel more robust. I am wondering though whether we’ll know by that point whether we’ve found the right way to do metaphilosophy (and how approaching that question is different from approaching whichever procedures philosophically sophisticated people would pick to settle open issues in something like the above proposals). It seems like there has to come a point where one has to hand off control to some in-advance specified “metaethical framework” or reflection procedure, and judged from my (historically overconfidence-prone) epistemic state it doesn’t feel obvious why something like Stuart’s anti-realism isn’t already close to there (though I’d say there are many open questions and I’d feel extremely unsure about how to proceed regarding for instance “2. A method for synthesising such basic preferences into a single utility function or similar object,” and also to some extent about the premise of squeezing a utility function out of basic preferences absent meta-preferences for doing that). Adding layers of caution sounds good though as long as they don’t complicate things enough to introduce large new risks.
Ethical theories don’t need to be simple. I used to have the belief that ethical theories ought to be simple/elegant/non-arbitrary for us to have a shot at them being the correct theory, a theory that intelligent civilizations with different evolutionary histories would all converge on. This made me think that NU might be that correct theory. Now I’m confident that this sort of thinking was confused: I think there is no reason to expect that intelligent civilizations with different evolutionary histories would converge on the same values, or that there is one correct set of ethics that they “should” converge on if they were approaching the matter “correctly”. So, looking back, my older intuition feels confused now in a similar way as ordering the simplest food in a restaurant in expectation of anticipating what others would order if they also thought that the goal was that everyone orders the same thing. Now I just want to order the “food” that satisfies my personal criteria (and these criteria do happen to include placing value on non-arbitrariness/simplicity/elegance, but I’m a bit less single-minded about it).
Your way of unifying psychological motivations down to suffering reduction is an “externalist” account of why decisions are made, which is different from the internal story people tell themselves. Why think all people who tell different stories are mistaken about their own reasons? The point “it is a straw man argument that NUs don’t value life or positive states“ is unconvincing, as others have already pointed out. I actually share your view that a lot of things people do might in some way trace back to a motivating quality in feelings of dissatisfaction, but (1) there are exceptions to that (e.g., sometimes I do things on auto-pilot and not out of an internal sense of urgency/need, and sometimes I feel agenty and do things in the world to achieve my reflected life goals rather than tend to my own momentary well-being), and (2) that doesn’t mean that whichever parts of our minds we most identify with need to accept suffering reduction as the ultimate justification of their actions. For instance, let’s say you could prove that a true proximate cause why a person refused to enter Nozick’s experience machine was that, when they contemplated the decision, they felt really bad about the prospect of learning that their own life goals are shallower and more self-centered than they would have thought, and *therefore* they refuse the offer. Your account would say: “They made this choice driven by the avoidance of bad feelings, which just shows that ultimately they should accept the offer, or choose whichever offer reduces more suffering all-things-considered.“ Okay yeah, that’s one story to tell. But the person in question tells herself the story that she made this choice because she has strong aspirations about what type of person she wants to be. Why would your externally-imported justification be more valid (for this person’s life) than her own internal justification?
I think I broadly agree with all the arguments to characterize the problem and to motivate indefinability as a solution, but I have a different (meta-)meta-level intuitions about how palatable indefinability would be, and as a result of that, I’d say I have been thinking about similar issues in a differently drawn framework. While you seem to advocate for “salvaging the notion of ’one ethics’“ while highlighting that we then need to live with indefinability, I am usually thinking of it in terms of: “Most of this is underdefined, and that’s unsettling at least in some (but not necessarily all) cases, and if we want to make it less underdefined, the notion of ‘one ethics’ has to give.“ Maybe one reason why I find indefinability harder to tolerate is because in my own thinking, the problem arises forcefully at an earlier/higher-order stage already, and therefore the span of views that “ethics” is indefinable about(?) is larger and already includes questions of high practical significance. Having said that, I think there are some important pragmatic advantages to an “ethics includes indefinability“ framework, and that might be reason enough to adopt it. While different frameworks tend to differ in the underlying intuitions they highlight or move into the background, I think there is more than one parsimonious framework in which people can “do moral philosophy“ in a complete and unconfused way. Translation between frameworks can be difficult though (which is one reason I started to write a sequence about moral reasoning under anti-realism, to establish a starting points for disagreements, but then I got distracted – it’s on hold now).
Some more unorganized comments (apologies for “lazy“ block-quote commenting):
Moral indefinability is the term I use for the idea that there is no ethical theory which provides acceptable solutions to all moral dilemmas, and which also has the theoretical virtues (such as simplicity, precision and non-arbitrariness) that we currently desire.
This idea seems correct to me. And as you indicate later in the paragraph, we can add that it’s plausible that the “theoretical virtues“ are not well-specified either (e.g., there’s disagreement between people’s theoretical desiderata, or there’s vagueness in how to cash out a desideratum such as “non-arbitrariness”).
My claim is that eventually we will also need to change our meta-level intuitions in important ways, because it will become clear that the only theories which match them violate key object-level intuitions.
This recommendation makes sense to me (insofar as one can still do that), but I don’t think it’s completely obvious. Because both meta-level intuitions and object-level intuitions are malleable in humans, and because there’s no(t obviously a) principled distinction between these two types of intuitions, it’s an open question to what degree people want to adjust their meta-level intuitions in order to not have to bite the largest bullets.
If the only reason people were initially tempted to bite the bullets in question (e.g., accept a counterintuitive stance like the repugnant conclusion) was because they had a cached thought that “Moral theories ought to be simple/elegant“, then it makes a lot of sense to adjust this one meta-level intuition after the realization that it seems ungrounded. However, maybe “Moral theories ought to be simple/elegant“ is more than just a cached thought for some people:
Some moral realists buy the “wager” that their actions matter infinitely more in case moral realism is true. I suspect that an underlying reason why they find this wager compelling is that they have strong meta-level intuitions about what they want morality to be like, and it feels to them that it’s pointless to settle for something other than that.
I’m not a moral realist, but I find myself having similarly strong meta-level intuitions about wanting to do something that is “non-arbitrary” and in relevant ways “simple/elegant”. I’m confused about whether that’s literally the whole intuition, or whether I can break it down into another component. But motivationally it feels like this intuition is importantly connected to what makes it easy for me to go “all-in“ for my ethical/altruistic beliefs.
A second reason to believe in moral indefinability is the fact that human concepts tend to be open texture: there is often no unique “correct” way to rigorously define them.
I strongly agree with this point. I think even very high-level concepts in moral philosophy or the philosophy of reason/self-interest are “open texture“ like that. In your post you seem to start with an assumption that people have a rough, shared sense of what “ethics“ is about. But if the fuzziness is already attacking at this very high level, it calls into question whether you can find a solution that seems satisfying to different people’s (fuzzy and underdetermined) sense of what the question/problem is even about.
For instance, there is the narrow interpretations such as “ethics as altruism/caring/doing good“ (which I think roughly captures at least large parts of what you assume, and it also captures the parts I’m personally most interested in). There’s also “ethics as cooperation or contract“. And maybe the two blend into each other.
Then there’s the broader (I label it “existentialist“) sense in which ethics is about “life goals“ or “Why do I get up in the morning?“. And within this broader interpretation of it, you suddenly get narrower subdomains like “realism about rationality“ or “What makes up a person’s self-interest?“ where the connection to the other narrower domains (e.g. “ethics as altruism“) are not always clear.
I think indefinability is a plausible solution (or meta-philosophical framework?) for all of these. But when the scope over which we observe indefinability becomes so broad, it illustrates why it might feel a bit frustrating for some people, because without clearly delineated concepts it can be harder to make progress, and so a framework in which indefinability plays a central role could in some cases obscure conceptual progress in subareas where one might be able to make such progress (at least at the “my personal morality“ level, though not necessarily at the level of a “consensus morality“).
(I’m not sure I’m disagreeing with you BTW; probably I’m just adding thoughts and blowing up the scope of your post.)
I would guess that many anti-realists are sympathetic to the arguments I’ve made above, but still believe that we can make morality precise without changing our meta-level intuitions much—for example, by grounding our ethical beliefs in what idealised versions of ourselves would agree with, after long reflection. My main objection to this view is, broadly speaking, that there is no canonical “idealised version” of a person, and different interpretations of that term could lead to a very wide range of ethical beliefs.
I agree. The second part of my comment here tries to talk about this as well.
And even if idealised reflection is a coherent concept, it simply passes the buck to your idealised self, who might then believe my arguments and decide to change their meta-level intuitions.
Yeah. I assume most of us are familiar with a deep sense of uncertainty about whether we found the right approach to ethical deliberation. And one can maybe avoid to feel this uncomfortable feeling of uncertainty by deferring to idealized reflection. But it’s not obvious that this lastingly solves the underlying problem: Maybe we’ll always feel uncertain whenever we enter the mode of “actually making a moral judgment“. If I found myself as a virtual person who is part of a moral reflection procedure such as Paul Christiano’s indirect normativity, I wouldn’t suddenly know and feel confident in how to resolve my uncertainties. And the extra power, and the fact that life in the reflection procedure would be very different from the world I currently know, introduces further risks and difficulties. I think there are still reasons why one might want to value particularly-open-ended moral reflection, but maybe it’s important that people don’t use the uncomfortable feeling of “maybe I’m doing moral philosophy wrong“ as their sole reason to value particularly-open-ended moral reflection. If the reality is that this feeling never goes away, then there seems something wrong with the underlying intuition that valuing particularly-open-ended moral reflection is by default the “safe” or “prudent” thing to do. (And I’m not saying it’s wrong for people value particularly-open-ended moral reflection; I suspect that it depends on one’s higher-order intuitions: For every perspective there’s a place where the buck stops.)
From an anti-realist perspective, I claim that perpetual indefinability would be better.
It prevents fanaticism, which is a big plus. And it plausibly creates more agreement, which is also a plus in some weirder sense (there’s a “non-identity problem” type thing about whether we can harm future agents by setting up the memetic environment such that they’ll end up having less easily satisfiable goals, compared to an alternative where they’d find themselves in larger agreement and therefore with more easily satisfiable goals). A drawback is that it can mask underlying disagreements and maybe harm underdeveloped positions relative to the status quo.
That may be a little more difficult to swallow from a realist perspective, of course. My guess is that the core disagreement is whether moral claims are more like facts, or more like preferences or tastes
That’s a good description. I sometimes use the analogy of “morality is more like career choice than scientific inquiry“.
I don’t think that’s a coincidence: psychologically, humans just aren’t built to be maximisers, and so a true maximiser would be fundamentally adversarial.
This is another good instrumental/pragmatic argument why anti-realists interested in shaping the memetic environment where humans engage in moral philosophy might want to promote the framing of indefinability rather than “many different flavors of consequentialism, and (eventually) we should pick“.
AlphaStar’s innovative league-based training process finds the approaches that are most reliable and least likely to go wrong.
“Go wrong” is still tied to the game’s win condition. So while the league-based training process does find the set of agents whose gameplay is least exploitable (among all the agents they trained), it’s not obvious how this relates to problems in AGI safety such as goal specification or robustness to capability gains. Maybe they’re thinking of things like red teaming. But without more context I’m not sure how safety-relevant this is.
2. The ability to comment on a specific line in a document, with the comment showing up in context.
Yeah, I really like how convenient that is.
For me there’s a huge difference between these two.
In gdocs I feel like it’s more okay to write “unpolished” comments. I think that’s mostly because the expectations are lower. Polishing my comments takes me 3-5x longer, which often takes away the motivation to comment at all.
In a public forum I worry more about provoking misleading impressions. For instance, in a gdoc shared with people who know me well, I’m not worried that a comment like “AIs might do [complex sequence of actions]” will get people to think that I have weirdly confident views about how the future might play out. In public conversations I’d experience a strong urge to qualify statements like that even though it feels tedious to do so.
You need a lot of hindsight bias to say that it was clear from the get go which paradigms were going to win over the last century.
Sure. And I think Kuhn’s main point as summarized by Scott really does give a huge blow to the naive view that you can just compare successful predictions to missed predictions, etc.
But to think that you cannot do better than chance at generating successful new hypotheses is obviously wrong. There would be way too many hypotheses to consider, and not enough scientists to test them. From merely observing science’s success, we can conclude that there has to be some kind of skill (Yudkowksy’s take on this is here and here, among other places) that good scientists employ to do better than chance at picking what to work on. And IMO it’s a strange failure of curiosity to not want to get to the bottom of this when studying Kuhn or the history of science.
When I hear scientists talk about Thomas Kuhn, he sounds very reasonable. [...] When I hear philosophers talk about Thomas Kuhn, he sounds like a madman.
Yes, this! I remember I was extremely confused by the discourse around Kuhn. I’m not sure whether for me the impression was split into scientists vs. non-scientists, but I definitely felt like there was something weird about it and there were too sides to it, one that sounded potentially reasonable, and one that sounded clearly like relativism.
When taking a course on the book, I concluded that both perspectives were appropriate. One thing that went too far into relativism was Kuhn’s insistence that there is no way to tell in advance which paradigm is going to be successful. His description of this is that you pick “teams” initially for all kinds of not-truth-tracking reasons, and you only figure out many years later whether your new paradigm will be winning or not.
But I’m not sure Kuhn even was (at least in The Structure of Scientific Revolutions) explicitly saying “No, you cannot do better than chance at picking sides.” Rather, the weird thing is that I remember feeling like he was not explicitly asking that question, that he was just brushing it under the carpet. Likewise the lecturer of the course, a Kuhn expert, seemed to only be asking the question “How does (human-)science proceed?”, and never “How should science proceed?”
Suppose the agent you’re trying to imitate is itself goal-directed. In order for the imitator to generalize beyond its training distribution, it seemingly has to learn to become goal-directed (i.e., perform the same sort of computations that a goal-directed agent would). I don’t see how else it can predict what the goal-directed agent would do in a novel situation. If the imitator is not able to generalize, then it seems more tool-like than agent-like. On the other hand, if the imitatee is not goal-directed… I guess the agent could imitate humans and be not entirely goal-directed to the extent that humans are not entirely goal-directed. (Is this the point you’re trying to make, or are you saying that an imitation of a goal-directed agent would constitute a non-goal-directed agent?)
I’m not sure these are the points Rohin was trying to make, but there seem to be at least two important points here:
Imitation learning applied to humans produces goal-directed behavior only insofar humans are goal-directed
Imitation learning applied to humans produces agents no more capable than humans. (I think IDA goes beyond this by adding amplification steps, which are separate. And IRL goes beyond this by trying to correct “errors” that the humans make.)
Regarding the second point, there’s a safety-relevant sense in which a human-imitating agent is less goal-directed than the human. Because if you scale the human’s capabilities, the human will become better at achieving its personal objectives. By contrast, if you scale the imitator’s capabilities, it’s only supposed to become even better at imitating the unscaled human.
I believe for some people it’s very important to have a moment of realization that one can get to the frontier of knowledge in a given field of interest. It feels intimidating if others are making contributions that seem decisively out of your league. Because people might intuitively underestimate how far you can get with focused reading and learning, it could be good to give tailored advice to people newer to (e.g.) AI risk for how/where they can make contributions that will feel encouraging. For illustration, a few years ago I was playing a computer game for fun for quite a while until I was by chance matched up with the one of the better competitive players and I almost won against them, getting lucky. That experience showed me that I’d have a shot if I actually tried, and it encouraged me to immediately start practicing with the aim of becoming competitive at that game. It changed my mindset over night. Similarly, I think there’s a difference in mindset between “reading and talking about research topics for fun” and “reading and talking about research topics with the intent of seriously contributing”.
I agree with others that a rewarding social environment and people in a similar range of competence you can bounce ideas back-and-forth with are extremely important. If you collaborate with people who are similarly driven to figure things out and discuss ideas with you, that automatically forces you think about your ideas for much longer and in more detail. By yourself you might stop thinking about a topic once you reach a roadblock, but if every morning you wake up to new messages by a collaborator adding criticism or new bits to your thinking, you’re going to keep working on the topic.
I also suspect that people are sometimes too modest (or in the wrong mindset) to develop the habit of “taking stances”. Some people know about a lot of different considerations and can tell you in detail what others have written, but they don’t invest effort coming up with their own opinion – presumably because they don’t consider themselves to be experts. Some of the community norms about not being overconfident might contribute to this failure mode, but the two things are distinct because people can try practicing taking stances with personal “pre-Aumann opinions”, which they are free to largely ignore when deferring to the experts for an all-things-considered judgment.
Speculation about personality traits conducive to generating ideas: OCD was mentioned in the comments. There’s also OCPD and hyperfocus. Carl Shulman’s advice for researchers among other things mentions something about having a strong emotional reaction to people being wrong on the internet (in communities you care about) – I think this might be a symptom of being very invested in the ideas, and it can help further clarify one’s thinking while trying to articulate fervently why something is wrong. Need for closure also seems relevant to me. It has its dangers because it can lead to one-sided thinking. But in me at least I’m often driven by feeling deeply unsatisfied with not having answers to questions that seem strategically important. And, anecdotally, I know some people with low need for closure who I consider to be phenomenal researchers in most important respects, but these people are less creative than I would be with their skills and backgrounds, and their obsessive focus maybe goes into greater width of research rather than zooming in on making progress on the “construction sites”. Finally, I strongly agree with John Maxwell’s point that a “temporary delusion” for thinking that one’s ideas are really good is a great reinforcement mechanism (even though it often leads to embarrassment later on).
I interpreted Wei’s comment as saying that even your reflective life goals would be underdetermined—presumably even now if you hear convincing moral argument A but not B, then you’d have different reflective life goals than if you hear B but not A.
Okay yeah, that also seems broadly correct to me.
I am hoping though that, as long as I’m not subjected to optimization pressures from outside that weren’t crafted to be helpful, it’s very rare that something I’d currently consider very important can end up either staying important or becoming completely unimportant merely based on order of new arguments encountered. And similarly I’m hoping that my value endpoints would still cluster decisively around the things I currently consider most important, – though that’s where it becomes tricky to trade off goal preservation versus openness for philosophical progress.
Thanks! I think I understand the intent of the rephrasing now.
What I meant with “obscure” is that both “true utility function” and “utility function that encodes the optimal actions to take for the best possible universe” have normative terminology in them that I don’t know how to reduce or operationalize.
For instance, imagine I am looking at action sequences and ranking them. Presumably large portions of that process would feel like difficult judgment calls where I’d feel nervous about still making some kind of mistake. Both your phrasings (to my ears) carry the connotation that there is a “best” mistake model, one which is in a relevant sense independent from our own judgment, where we can learn things that will make us more and more confident that now we’re probably not making mistakes anymore because of progress in finding the correct way of thinking about our values. That’s the part that feels obscure to me because I think we’ll always be in this unsatisfying epistemic situation where we’re nervous about making some kind of mistake by the light of a standard that we cannot properly describe.
I do get the intuition for thinking in these terms, though. It feels conceivable that another discovery similar to what cognitive biases did could improve our thinking, and I definitely agree that we want a concept for staying open to this possibility. I’m just pointing out that non-operationalized normative concepts seem obscure. (Though maybe that’s fine if we’re treating them in the same way Yudkowsky treats “magic reality fluid” – as a placeholder for whatever comes once we’re less confused about “measure”.)
This post comes from a theoretical perspective that may be alien to ML researchers; in particular, it makes an argument that simplicity priors do not solve the problem pointed out here, where simplicity is based on Kolmogorov complexity (which is an instantiation of the Minimum Description Length principle). The analog in machine learning would be an argument that regularization would not work.
Out of curiosity, is there an intuitive explanation as to why these are different? Is it mainly because ambitious value learning inevitably has to deal with lots of (systematic) mistakes in the data, whereas normally you’d make sure that the training data doesn’t contain (many) obvious mistakes? Or are there examples in ML where you can retroactively correct mistakes imported from a flawed training set?
(I’m not sure “training set” is the right word for the IRL context. Applied to ambitious value learning, what I mean would be the “human policy”.)
Update: Ah, it seems like the next post is all about this! :) My point about errors seems like it might be vaguely related, but the explanation in the next post feels more satisfying. It’s a different kind of problem because you’re not actually interested in predicting observable phenomena anymore, but instead are trying to infer the “latent variable” – the underlying principle(?) behind the inputs. The next post in the sequence also gives me a better sense of why people say that ML is typically “shallow” or “surface-level reasoning”.
Of course this is all assuming that there does exist a true utility function, but I think we can replace “true utility function” with “utility function that encodes the optimal actions to take for the best possible universe” and everything still follows through.
The replacement feels just as obscure to me as the original.
But more generally, if you think that a different set of life experiences means that you are a different person with different values, then that’s a really good reason to assume that the whole framework of getting the true human utility function is doomed. Not just ambitious value learning, _any_ framework that involves an AI optimizing some expected utility would not work.
This statement feels pretty strong, especially given that I find it trivially true that I’d be a different person under many plausible alternative histories. This makes me think I’m probably misinterpreting something. :)
At first I read your paragraph as the strong claim that if it’s true that individual human values are underdetermined at birth, then ambitious value learning looks doomed. And I’d take it as proof for “individual human values are underdetermined at birth” if, replaying history, I’d now have different values (or a different probability distribution over values) if I had encountered Yudkowsky’s writings before Singer’s, rather than vice-versa. Or if I would be less single-minded about altruism had I encountered EA a couple of years later in life, after already taking on another self-identity.
But these points (especially the second example) seem so trivially true that I’m probably talking about a different thing. In addition, they’re addressed by the solution you propose in your first paragraph, namely taking current-you as the starting point.
Another concern could be that “there is almost never a stable core of an individual human’s values”, i.e., that “even going forward from today, the values of Lukas or Rohin or Wei are going to be heavily underdetermined”. Is that the concern? This seems like it could be possible for most people, but definitely not for all people. And undetermined values are not necessarily that bad (though I find it mildly disconcerting, personally). [Edit: Wei’s comment and your reply to it sounds like this might indeed be the concern. :) Good discussion there!]
The fact that I have a hard time understanding the framework behind your statement is probably because I’m thinking in terms of a different part of my brain when I talk about “my values”. I identify very much with my reflective life goals to a point that seems unusual. I don’t identify much with “What Lukas’s behavior, if you were to put him in different environments and then watch, would indirectly consistently tell you about the things he appears to want – e.g., ‘values’ like being held in high esteem by others, having a comfortable life, romance, having either some kind of overarching purpose or enough distractions to not feel bother by the lack of purpose, etc.”. There is definitely a sense in which the code that runs me is caring about all these implicit goals. But that’s not how I most want to see it. I also know that in all the environments that offer the options to self-modify into a more efficient pursuer of explicitly held personal ideals, I would make substantial use of the option to self-modify. And that seems relevant for the same reason that we wouldn’t want to count cognitive biases as people’s values.
(I should probably continue reading the sequence and then come back to this later if I still feel unclear about it.)
And what about the tradeoff? Is there one?
What you mention in your second possibility (“rote, robotic way”) goes into a similar direction, but I’d be worried about something more specific: Difficulties at big-picture prioritization when it comes to selecting what to be interested in. I envy people who find it easy to delve into all kinds of subjects and absorb a wealth of knowledge. But those same people may then fail to be curious enough when they encounter a piece of information that really would be much more relevant than the information they usually encounter. Or they might spend their time on tasks that don’t produce the most impact.
Admittedly I’m looking at this with a terribly utilitarianism-tainted lens. Probably finding it easy to be interested in many things is generally a huge plus.
But I do suspect that there’s a tradeoff. If reading about the Battle of Cedar Creek felt 30% as interesting to our brains as reading cognitive science or Lesswrong or Peter Singer or whatever got people here hooked on these sorts of things, then maybe fewer of us would have gotten hooked.
I think I’m talking about a different concept than you are talking about. Here’s what I take to be hypocrisy that is probably/definitely bad:
When someone’s brain is really good at selective remembering and selective forgetting, remembering things so they are convenient, and forgetting things that are inconvenient. And when the person is either unconsciously or only semi-consciously acting as an amplifier of opinions, sensing where a group is likely to go and then pushing (and often overshooting) the direction in order to be first to score points. This is where flip-flopping gets its bad reputation. At the extreme the person may fail to distinguish, in terms of mental motions, between what is their actual opinion vs. what opinion they expect to earn praise.
Some of these may not always come together but I think they often do, and the common theme is self-deception and little introspection. For instance, something many people do without noticing: Everyone’s opinions fluctuate over time; sometimes you feel lukewarm about an idea, at other times you’re an ardent supporter. If it later turns out that the idea was great, you remember mostly the times you supported it. If it turns out the idea was absolutely horrible, you’re tempted to specifically remember this one 2-week window half a year before the idea fell out of fashion where you felt lukewarm about it and voiced doubts to someone (or were “almost” going to do that), and you then tell yourself and others that “you called it” even though, in reality, you totally failed to pay attention to your doubts.
Another example: You fail to understand or spot a good idea when you first hear it, then later once the context makes it more obvious that the idea was great, it occurs to you and you think it’s entirely your own idea, so much so that you’d enthusiastically tell it to the person you first heard it from. (Often this is innocent, but if it happens an uncanny number of times maybe it’s a reason to start paying attention.)
I think this type of hypocrisy hinders self growth, can prevent the right people from getting credit and amplifies group biases. So I’d say it’s very bad. But norms against hypocrisy have to be careful because it’s something that everyone might have to some degree, and the costs of enforcing norms need to be kept smaller than the actual problem. Keeping score or arguing over whose memory about something is right can create an atmosphere with effects just as bad as extreme hypocrisy itself. Sometimes hypocrisy is fueled by a desire to be held in high regard, and then being accused of hypocrisy may also worsen the mechanisms at work.
I believe that you’re right about the historicity, but for me at least, any explanations of UDT I came across a couple of years ago seemed too complicated for me to really grasp the implications for anthropics, and ADT (and the appendix of Brian’s article here) were the places where things first fell into place for my thinking. I still link to ADT these days as the best short explanation for reasoning about anthropics, though I think there may be better explanations of UDT now (suggestions?). Edit: I of course agree with giving credit to UDT being good practice.