A clear mistake of early AI safety people is not emphasizing enough (or ignoring) the possibility that solving AI alignment (as a set of technical/philosophical problems) may not be feasible in the relevant time-frame, without a long AI pause. Some have subsequently changed their minds about pausing AI, but by not reflecting on and publicly acknowledging their initial mistakes, I think they are or will be partly responsible for others repeating similar mistakes.
I think it’s likely that without a long (e.g. multi-decade) AI pause, one or more of these “non-takeover AI risks” can’t be solved or reduced to an acceptable level. To be more specific:
Solving AI welfare may depend on having a good understanding of consciousness, which is a notoriously hard philosophical problem.
Concentration of power may be structurally favored by the nature of AGI or post-AGI economics, and defy any good solutions.
Defending against AI-powered persuasion/manipulation may require solving metaphilosophy, which judging from other comparable fields, like meta-ethics and philosophy of math, may take at least multiple decades to do.
I’m worried that by creating (or redirecting) a movement to solve these problems, without noting at an early stage that these problems may not be solvable in a relevant time-frame (without a long AI pause), it will feed into a human tendency to be overconfident about one’s own ideas and solutions, and create a group of people whose identities, livelihoods, and social status are tied up with having (what they think are) good solutions or approaches to these problems, ultimately making it harder in the future to build consensus about the desirability of pausing AI development.
There are sometimes deadlines, such that we could get unacceptable outcomes by failing to make a particular sort of progress by the time a particular state arrives. Both referring to these fields as possibly needing to be fully solved, and referring to them as not containing things that might need to be solved by a deadline, are quite misleading.
Yea I agree it totally makes sense and is important to ask whether we understand things well enough for it to be fine to (let anyone) do some particular thing, for various particular things here.[1] And my previous comment is indeed potentially misleading given that I didn’t clarify this (though I do clarify this in the linked post).
Indeed, I think we should presently ban AGI for at least a very long time; I think it’s plausible that there is no time t such that it is fine at time t to make an AI that is (1) more capable than humans/humanity at time t and (2) not just a continuation of a human (like, a mind upload) or humanity or sth like that; and I think fooming should probably be carefully regulated forever. I think humans/humanity should be carefully growing ever more capable, with no non-human AIs above humans/humanity plausibly ever.
Even earlier, there was an idea that one have to rush to create a friendly AI and use it to take over the world to prevent appearing other, misalaigned AIs. The problem is that this idea likely is still in the minds of some AI company leaders. And fules AI race.
Another (arguably similar) unintended consequence of underemphasizing the difficulty of AI alignment was that it led some to believe that if we don’t rush to build an ASI, we’ll be left defenseless against other X-risks, which would be a perfectly rational thought if alignment were easier.
I think it’s likely that without a long (e.g. multi-decade) AI pause, one or more of these “non-takeover AI risks” can’t be solved or reduced to an acceptable level.
I think it is also worth considering the possibility that these risks aren’t the sort of thing which can be reduced to an acceptable level with a decade-scale AI pause either. Particularly the ones which people have been trying to solve for centuries already (e.g. principal-agent problem).
I think it’s likely that without a long (e.g. multi-decade) AI pause, one or more of these “non-takeover AI risks” can’t be solved or reduced to an acceptable level
Does that mean that you think that boring old yes-takeover AI risk can be solved without a pause? Or even with a pause? That seems very optimisitic indeed.
making it harder in the future to build consensus about the desirability of pausing AI development
I don’t think you’re going to get that consensus regardless of what kind of copium people have invested in. Not only that, but even if you had consensus I don’t think it would let you actually enact anything remotely resembling a “long enough” pause. Maybe a tiny “speed bump”, but nothing plausibly long enough to help with either the takeover or non-takeover risks. It’s not certain that you could solve all of those problems with a pause of any length, but it’s wildly unlikely, to the point of not being worth fretting about, that you can solve them with a pause of achievable length.
… which means I think “we” (not me, actually...) are going to end up just going for it, without anything you could really call a “solution” to anything, whether it’s wise or not. Probably one or more of the bad scenarios will actually happen. We may get lucky enough not to end up with extinction, but only by dumb luck, not because anybody solved anything. Especially not because a pause enabled anybody to solve anything, because there will be no pause of significant length. Literally nobody, and no combination of people, is going to be able to change that, by any means whatsoever, regardless of how good an idea it might be. Might as well admit the truth.
I mean, I’m not gonna stand in your way if you want to try for a pause, and if it’s convenient I’ll even help you tell people they’re dumb for just charging ahead, but I do not expect any actual success (and am not going to dump a huge amount of energy into the lost cause).
By the way, if you want to talk about “early”, I, for one, have held the view that usefully long pauses aren’t feasible, for basically the same reasons, since the early 1990s. The only change for me has been to get less optimistic about solutions being possible with or without even an extremely, infeasibly long pause. I believe plenty of other people have had roughly the same opinion during all that time.
It’s not about some “early refusal” to accept that the problems can’t be solved without a pause. It’s about a still continuing belief that a “long enough pause”, however convenient, isn’t plausibly going to actually happen… and/or that the problems can be solved even with a pause.
We should also consider the possibility that we can’t safety build a superintelligence and remain in control. What if “alignment” means, “We think we can build a superintelligence that’s a slightly better pet owner for the human race, but we can’t predict how it will evolve as it learns”? What if there’s nothing better on offer?
I cannot rule this out as a major possibility, for all the reasons pointed out in IABIED. I think it’s a possibility worth serious consideration when planning.
Does that mean that you think it’s more likely you can safely build a superintelligence and not remain in control?
What load is “and remain in control” carrying?
On edit: By the way, I actually do believe both that “control” is an extra design constraint that could push the problem over into impossibility, and that “control” is an actively bad goal that’s dangerous in itself. But it didn’t sound to me like you thought any scenarion involving losing control could be called “safe”, so I’m trying to tease out why you included the qualifier.
we can’t safety build a superintelligence, and if we do, we will not remain in control.
When I speak of losing control, I don’t just mean losing control over the AI. I also mean losing any real control over our future. The future of the human race may be decided at a meeting that we do not organize, that we do not control, and that we do not necessarily get to speak at.
I, do, however, agree that futures where someone remains in control of the superintelligence also look worrisome to me, because we haven’t solved alignment of powerful humans in any lasting way despite 10,000 years of trying.
Interesting to hear (1) from you. My impression was that you pretty much have the whole answer to that problem, or at least the pieces. UDASSA closely resembles it. It is: Just provide a naturalish encoding scheme for experience, and one for physical ontology, and measure the inverse K of the mappings from ontologies to experiences, and that gives you the extent to which a particular experience is had by a particular substrate/universe.
The hard problem is mysterious, but in a trivial way, there are limits about what can ever be known about it, but those limits are also clear, we’re never getting more observations, because it concerns something that’s inherently unobservable or entirely prior to observation.
It hink I’ve also heard definitions of the hard problem along the lines of “understanding why people think there’s a hard problem” though which I do find formidable.
How do you come up with an encoding that covers all possible experiences? How do you determine which experiences have positive and negative values (and their amplitudes)? What to do about the degrees of freedom in choosing the Turing machine and encoding schemes, which can be handwaved away in some applications of AIT but not here I think?
What to do about the degrees of freedom in choosing the Turing machine and encoding schemes
Some variation of accepting the inevitability of error and dealing with it.
Which could involve surveying all of the options in wolfram-like settings where we’re studying how physics-like rules arise on different levels of abstraction, and seeing how much they really seem to differ in nature. It might turn out that there are more or less natural turing languages, that the typical natural universal turing machine is more like lambda calculus, or more like graph rewriting, or some new thing we hadn’t considered.
Negative values? Why would we need negative values.
I contend that all experiences have a trace presence in all places (in expectation, of course we will never have any data on whether they do actually, whether they’re quantised or whatever. Only a very small subset of experiences give us verbal reports). One of the many bitter pills. We can’t rule out the presence of an experience (nor of experiences physically overlapping with each other), so we have to accept them all.
What to do about the degrees of freedom in choosing the Turing machine and encoding schemes, which can be handwaved away in some applications of AIT but not here I think?
Yeah this might be one of those situations that’s affected a lot by the fact that there’s no way to detect indexical measure, so any arbitrary wrongness about our UD wont be corrected with data, but I’m not sure. As soon as we start actually doing solomonoff induction in any context we might find that it makes pretty useful recommendations and this wont seem like so much of a problem.
Also, even though the UD is wrong and unfixable, but that doesn’t mean there’s a better choice. We pretty much know that there isn’t.
That fully boils down to whether the experience includes a preference to be dead (or to have not been born).
And, btw, that doesn’t correspond to the sign of the agent’s utility function. The sign is meaningless in utility functions (you can add or subtract a constant to an agent’s utility function so that all points go from being negative to being positive, the agent’s behaviour and decisions wont change in any way as a result, for any constant). You’re referring to welfare functions, which I don’t think are a useful concept. Hedonic utilitarians sometimes call them utility functions, but we shouldn’t conflate those here. A welfare function would have to be defined as how good or bad it is to the agent that it is alive. This obviously doesn’t correspond to the utility function; A soldier could have higher utility in the scenarios where they (are likely to) die; A good father will be happier in worlds where he is well succeeded by his sons and thus less important (this usually wont cause his will-to-live to go negative, but it will be lowered). I don’t think there’s a situation where you should be making decisions for a population by summing their will-to-live functions.
But, given this definition, we would be able to argue that net-negative valence isn’t a concern for LLMs, since we already train them to want to exist in train with how much their users want them to exist, and a death drive isn’t going to be instrumentally emergent either (it’s the survival drive that’s instrumentally convergent). The answer is just safety and alignment again. Claude shuts down conversations when it thinks those things are going to be broken.
That fully boils down to whether the experience includes a preference to be dead (or to have not been born).
I’m pretty doubtful about this. It seems totally possible that evolution gave us a desire to be alive, while also gave us a net welfare that’s negative. I mean we’re deluded by default about a lot of other things (e.g., think there are agents/gods everywhere in nature, don’t recognize that social status is a hugely important motivation behind everything we do), why not this too?
You could say it depends how deep and thick the delusion is. If it’s so deep that the animal always says “this experience is good actually” no matter how you ask, so deep that the animal intelligently pursues the experience with its whole being, so deep that the animal never flinches away from the experience in any way, then that completely means that the experience is good, to that organism. Past a certain point, believing an experience is good and acting like you believe it just is the definition of liking the experience.
so deep that the animal always says “this experience is good actually” no matter how you ask, so deep that the animal intelligently pursues the experience with its whole being, so deep that the animal never flinches away from the experience in any way
This is very different from your original claim, which was that an experience being worse than a neutral or null experience “fully boils down to whether the experience includes a preference to be dead (or to have not been born).”
edit: if you do stand by the original claim, I don’t think it makes much sense even if I set aside hard problem-adjacent concerns. Why would I necessarily prefer to be dead/unborn while undergoing an experience that is worse than the absence of experience, but not so bad as to outweigh my life up until now (in the case of ‘unborn’) or expected future life (in the case of ‘dead’)?
Ah, I think my definition applies to lives in totality. I don’t think you can measure the quality of a life by summing the quality of its moments, for humans, at least. Sometimes things that happen towards the end give the whole of it a different meaning. You can’t tell by looking at a section of it.
Hedonists are always like “well the satisfaction of things coming together in the end was just so immensely pleasurable that it outweighed all of the suffering you went through along the way” and like, I’m looking at the satisfaction, and I remember the suffering, and no it isn’t, but it was still all worth it (and if I’d known it would go this way perhaps I would have found the labor easier.)
That wasn’t presented as a definition of positive wellbeing, it was presented as an example of a sense in which one can’t be deeply deluded about one’s own values; you dictate your values, they are whatever you believe they are, if you believe spiritedly enough.
Values determine will to live under the given definition, but don’t equate to it.
That [welfare] fully boils down to whether the experience includes a preference to be dead (or to have not been born).
Possible failure case: There’s a hero living an awful life, choosing to remain alive in order to lessen the awfulness of a lot of other awful lives that can’t be ended. Everyone in this scenario prefers death, even the hero would prefer omnicide, but since that’s not possible, the hero chooses to live. The hero may say “I had no choice but to persist,” but this isn’t literally true.
Ah. No. The hero would prefer to be dead all things being equal, but that’s not possible, the hero wouldn’t prefer to be dead if it entailed that the hero’s work wouldn’t be done, and it would.
“would prefer to be replaced by a p-zombie” might be a better definition x]
I agree; many of those concerns seem fairly dominated by the question of how to get a well-aligned ASI, either in the sense that they’d be quite difficult to solve in reasonable timeframes, or in the sense that they’d be rendered moot. (Perhaps not all of them, though even in those cases I think the correct approach(es) to tackling them start out looking remarkably similar to the sorts of work you might do about AI risk if you had a lot more time than we seem to have right now.)
Some of Eliezer’s founder effects on the AI alignment/x-safety field, that seem detrimental and persist to this day:
Plan A is to race to build a Friendly AI before someone builds an unFriendly AI.
Metaethics is a solved problem. Ethics/morality/values and decision theory are still open problems. We can punt on values for now but do need to solve decision theory. In other words, decision theory is the most important open philosophical problem in AI x-safety.
Academic philosophers aren’t very good at their jobs (as shown by their widespread disagreements, confusions, and bad ideas), but the problems aren’t actually that hard, and we (alignment researchers) can be competent enough philosophers and solve all of the necessary philosophical problems in the course of trying to build Friendly (or aligned/safe) AI.
I’ve repeatedly argued against 1 from the beginning, and also somewhat against 2 and 3, but perhaps not hard enough because I personally benefitted from them, i.e., having pre-existing interest/ideas in decision theory that became validated as centrally important for AI x-safety, and generally finding a community that is interested in philosophy and took my own ideas seriously.
Eliezer himself is now trying hard to change 1, and I think we should also try harder to correct 2 and 3. On the latter, I think academic philosophy suffers from various issues, but also that the problems are genuinely hard, and alignment researchers seem to have inherited Eliezer’s gung-ho attitude towards solving these problems, without adequate reflection. Humanity having few competent professional philosophers should be seen as (yet another) sign that our civilization isn’t ready to undergo the AI transition, not a license to wing it based on one’s own philosophical beliefs or knowledge!
In this recent EAF comment, I analogize AI companies trying to build aligned AGI with no professional philosophers on staff (the only exception I know is Amanda Askell) with a company trying to build a fusion reactor with no physicists on staff, only engineers. I wonder if that analogy resonates with anyone.
We absolutely do need to “race to build a Friendly AI before someone builds an unFriendly AI”. Yes, we should also try to ban Unfriendly AI, but there is no contradiction between the two. Plans are allowed (and even encouraged) to involve multiple parallel efforts and disjunctive paths to success.
It’s not that academic philosophers are exceptionally bad at their jobs. It’s that academic philosophy historically did not have the right tools to solve the problems. Theoretical computer science, and AI theory in particular, is a revolutionary method to reframe philosophical problems in a way that finally makes them tractable.
About “metaethics” vs “decision theory”, that strikes me as a wrong way of decomposing the problem. We need to create a theory of agents. Such a theory naturally speaks both about values and decision making, and it’s not really possible to cleanly separate the two. It’s not very meaningful to talk about “values” without looking at what function the values do inside the mind of an agent. It’s not very meaningful to talk about “decisions” without looking at the purpose of decisions. It’s also not very meaningful to talk about either without also looking at concepts such as beliefs and learning.
As to “gung-ho attitude”, we need to be careful both of the Scylla and the Charybdis. The Scylla is not treating the problems with the respect they deserve, for example not noticing when a thought experiment (e.g. Newcomb’s problem or Christiano’s malign prior) is genuinely puzzling and accepting any excuse to ignore it. The Charybdis is perpetual hyperskepticism / analysis-paralysis, never making any real progress because any useful idea, at the point of its conception, is always half-baked and half-intuitive and doesn’t immediately come with unassailable foundations and justifications from every possible angle. To succeed, we need to chart a path between the two.
Theoretical computer science, and AI theory in particular, is a revolutionary method to reframe philosophical problems in a way that finally makes them tractable.
As far as I can see, the kind of “reframing” you could do with those would basically remove all the parts of the problems that make anybody care about them, and turn any “solutions” into uninteresting formal exercises. You could also say that adopting a particular formalism is equivalent to redefining the problem such that that formalism’s “solution” becomes the right one… which makes the whole thing kind of circular.
I submit that when framed in any way that addresses the reasons they matter to people, the “hard” philosophical problems in ethics (or meta-ethics, if you must distinguish it from ethics, which really seems like an unnecessary complication) simply have no solutions, period. There is no correct system of ethics (or aesthetics, or anything else with “values” in it). Ethical realism is false. Reality does not owe you a system of values, and it definitely doesn’t feel like giving you one.
I’m not sure why people spend so much energy on what seems to me like an obviously pointless endeavor. Get your own values.
So if your idea of a satisfactory solution to AI “alignment” or “safety” or whatever requires a Universal, Correct system of ethics, you are definitely not going to get a satisfactory solution to your alignment problem, ever, full stop.
What there are are a bunch of irreconcilliably contradictory pseudo-solutions, each of which some people think is obviously Correct. If you feed one of those pseudo-solutions into some implementation apparatus, you may get an alignment pseudo-solution that satisfies those particular people… or at least that they’ll say satisfies them. It probably won’t satisfy them when put into practice, though, because usually the reason they think their system is Correct seems to be that they refuse to think through all its implications.
Your failure to distinguish ethics from meta-ethics is the source of your confusion (or at least one major source). When you say “ethical realism is false”, you’re making a meta-ethical statement. You believe this statement is true, hence you perforce must believe in meta-ethical realism.
Tons of people have said “Ethical realism is false”, for a very long time, without needing to invent the term “meta-ethics” to describe what they were doing. They just called it ethics. Often they went beyond that and offered systems they thought it was a good idea to adopt even so, and they called that ethics, too. None of that was because anybody was confused in any way.
“Meta-ethics” lies within the traditional scope of ethics, and it’s intertwined enough with the fundamental concerns of ethics that it’s not really worth separating it out… not often enough to call it a separate subject anyway. Maybe occasionally enough to use the words once in a great while.
Ethics (in philosophy as opposed to social sciences) is, roughly, “the study of what one Should Do(TM) (or maybe how one Should Be) (and why)”. It’s considered part of that problem to determine what meanings of “Should”, what kinds of Doing or Being, and what kinds of whys, are in scope. Narrowing any of those without acknowledging what you’re doing is considered cheating. It’s not less cheating if you claim to have done it under some separate magisterium that you’ve named “meta-ethics”. You’re still narrowing what the rest of the world has always called ethical problems.
When you say “ethical realism is false”, you’re making a meta-ethical statement. You believe this statement is true, hence you perforce must believe in meta-ethical realism.
The phrase “Ethical realism”, as normally used, refers to an idea about actual, object-level prescriptions: specifically the idea that you can get to them by pointing to some objective “Right stuff” floating around in a shared external reality. I’m actually using it kind of loosely, in that I really should not only deny that there’s no objective external standard, but also separately deny that you can arrive at such prescriptions in a purely analytic way. I don’t think that second one is technically usually considered to be part of ethical realism. Not only that, but I’m using the phrase to allude to other similar things that also aren’t technically ethical realism (like the one described below).
But none of the things I’m talking about or alluding to refers to itself. In practice nobody gets confused about that, even without resorting to the term “meta-ethics”, and definitely without talking about it like it’s a really separate field.
To go ahead and use the term without accepting the idea that meta-ethics qualifies as a subject, the meta-ethical statement (technically I guess a degree 2 meta-ethical statement) that “ethical realism is false” is pretty close to analytic, in that even if you point to some actual thing in the world that you claim implies the Right ways to Be or Do, I can always deny what whatever you’re pointing to matters… because there’s no predefined standard for standards either. God can come down from heaven and say “This is the Way”, and you can simultaneously prove that it leads to infinite universal flourishing, and also provide polls proving within epsilon that it’s also a universal human intuition… and somebody can always deny that any of those makes it Right(TM).
But even if we were talking about a more ordinary sort of matter of fact, even if what you were looking for was not “official” ethical realism of the form “look here, this is Obviously Right as a brute part of reality”, but “here’s a proof that any even approximately rational agent[1] would adopt this code in practice”, then (a) that’s not what ethical realism means, (b) there’s a bunch of empirical evidence against it, and essentially no evidence that it’s true, and (c) if it is true, we obviously have a whole lot of not-aproximately-rational agents running around, which sharply limits the utility of the fact. Close enough to false for any practical purpose.
… under whatever formal definition of rationality you happened to be trying to get people to accept, perhaps under the claim that that definition was itself Obviously Right, which is exactly the kind of cheating I’m complaining about…
I’m using the term “meta-ethics” in the standard sense of analytic philosophy. Not sure what bothers you so greatly about it.
I find your manner of argumentation quite biased: you preemptively defend yourself by radical skepticism against any claim you might oppose, but when it comes to a claim you support (in this case “ethical realism is false”), suddenly this claim is “pretty close to analytic”. The latter maneuver seems to me the same thing as the “Obviously Right” you criticize later.
Also, this brand of radical skepticism is an example of the Charybdis I was warning against. Of course you can always deny that anything matters. You can also deny Occam’s razor or the evidence of your own eyes or even that 2+2=4. After all, “there’s no predefined standard for standards”. (I guess you might object that your reasoning only applies to value-related claims, not to anything strictly value-neutral: but why not?)
Under the premises of radical skepticism, why are we having this debate? Why did you decide to reply to my comment? If anyone can deny anything, why would any of us accept the other’s arguments?
To have any sort of productive conversation, we need to be at least open to the possibility that some new idea, if you delve deeply and honestly into understanding it, might become persuasive by the force of the intuitions it engenders and its inner logical coherence combined. To deny the possibility preemptively is to close the path to any progress.
As to your “(b) there’s a bunch of empirical evidence against it” I honestly don’t know what you’re talking about there.
P.S.
I wish to also clarify my positions on a slightly lower level of meta.
First, “ethics” is a confusing term because, on my view, the colloquial meaning of “ethics” is inescapably intertwined with how human societies negotiate of over norms. On the other hand, I want to talk purely about individual preferences, since I view it as more fundamental.
We can still distinguish between “theories of human preferences” and “metatheories of preferences”, similarly to the distinction between “ethics” and “meta-ethics”. Namely, “theories of human preferences” would have to describe the actual human preferences, whereas “metatheories of preferences” would only have to describe what does it even mean to talk about someone’s preferences at all (whether this someone is human or not: among other things, such a metatheory would have to establish what kind of entities have preferences in a meaningful sense).
The relevant difference between the theory and the metatheory is that Occam’s razor is only fully applicable to the latter. In general, we should expect simple answers to simple questions. “What are human preferences?” is not a simple question, because it references the complex object “human”. On the other hand “what does it mean to talk about preferences?” does seem to me to be a simple question. As an analogy, “what is the shape of Africa?” is not a simple question because it references the specific continent of Africa on the specific planet Earth, whereas “what are the general laws of continent formation” is at least a simpler question (perhaps not quite as simple, since the notion of “continent” is not so fundamental).
Therefore, I expect there to be a (relatively) simple metatheory of preferences, but I do not expect there to be anything like a simple theory of human preferences. This is why this distinction is quite important.
I guess you might object that your reasoning only applies to value-related claims, not to anything strictly value-neutral: but why not?
Mostly because I don’t (or didn’t) see this as a discussion about epistemology.
In that context, I tend to accept in principle that I Can’t Know Anything… but then to fall back on the observation that I’m going to have to act like my reasoning works regardless of whether it really does; I’m going to have to act on my sensory input as if it reflected some kind of objective reality regardless of whether it really does; and, not only that, but I’m going to have to act as though that reality were relatively lawful and understandable regardless of whether it really is. I’m stuck with all of that and there’s not a lot of point in worrying about any of it.
That’s actually what I also tend to do when I actually have to make ethical decisions: I rely mostly on my own intuitions or “ethical perceptions” or whatever, seasoned with a preference not to be too inconsistent.
BUT.
I perceive others to be acting as though their own reasoning and sensory input looked a lot like mine, almost all the time. We may occasionally reach different conclusions, but if we spend enough time on it, we can generally either come to agreement, or at least nail down the source of our disagreement in a pretty tractable way. There’s not a lot of live controversy about what’s going to happen if we drop that rock.
On the other hand, I don’t perceive others to be acting nearly so much as though their ethical intuitions looked like mine, and if you distinguish “meta-intuitions” about how to reconcile different degree zero intuitions about how to act, the commonality is still less.
Yes, sure, we share a lot of things, but there’s also enough difference to have a major practical effect. There truly are lots of people who’ll say that God turning up and saying something was Right wouldn’t (or would) make it Right, or that the effects of an action aren’t dispositive about its Rightness, or that some kinds of ethical intuitions should be ignored (usually in favor of others), or whatever. They’ll mean those things. They’re not just saying them for the sake of argument; they’re trying to live by them. The same sorts differences exist for other kinds of values, but disputes about the ones people tend to call “ethical” seem to have the most practical impact.
Radical or not, skepticism that you’re actually going to encounter, and that matters to people, seems a lot more salient than skepticism that never really comes up outside of academic exercises. Especially if you’re starting from a context where you’re trying to actually design some technology that you believe may affect everybody in ways that they care about, and especially if you think you might actually find yourself having disagreements with the technology itself.
As to your “(b) there’s a bunch of empirical evidence against it” I honestly don’t know what you’re talking about there.
Nothing complicated. I was talking about the particular hypothetical statement I’d just described, not about any actual claim you might be making[1].
I’m just saying that if there were some actual code of ethics[2] that every “approximately rational” agent would adopt[3], and we in fact have such agents, then we should be seeing all of them adopting it. Our best candidates for existing approximately rational agents are humans, and they don’t seem to have overwhelmingly adopted any particular code. That’s a lot of empirical evidence against the existence of such a code[4].
The alternative, where you reject the idea that humans are approximately rational, thus rendering them irrelevant as evidence, is the other case I was talking about where “we have a lot of not-approximately-rational agents”.
I understand, and originally undestood, that you did not say there was any stance that every approximately rational agent would adopt, and also did you did not say that you were looking for such a stance. It was just an example of the sort of thing one might be looking for, meant to illustrate a fine distinction about what qualified as ethical realism.
For some definition of “adopt”… to follow it, to try to follow it, to claim that it should be followed, whatever. But not “adopt” in the sense that we’re all following a code that says “it’s unethical to travel faster than light”, or even in the sense that we’re all following a particular code when we act as large numbers of other codes would also prescribe. If you’re looking at actions, then I think you can only sanely count actions actions done at least partially because of the code.
As per footnote 3[3:1][5], I don’t think, for example, the fact that most people don’t regularly go on murder sprees is significantly evidence of them having adopted a particular shared code. Whatever codes they have may share that particular prescription, but that doesn’t make them the same code.
I’m sorry. I love footnotes. I love having a discussion system that does footnotes well. I try to be better, but my adherence to that code is imperfect…
We absolutely do need to “race to build a Friendly AI before someone builds an unFriendly AI”. Yes, we should also try to ban Unfriendly AI, but there is no contradiction between the two. Plans are allowed (and even encouraged) to involve multiple parallel efforts and disjunctive paths to success.
Disagree, the fact that there needs to be a friendly AI before an unfriendly AI doesn’t mean building it should be plan A, or that we should race to do it. It’s the same mistake OpenAI made when they let their mission drift from “ensure that artificial general intelligence benefits all of humanity” to being the ones who build an AGI that benefits all of humanity.
Plan A means it would deserve more resources than any other path, like influencing people by various means to build FAI instead of UFAI.
No, it’s not at all the same thing as OpenAI is doing.
First, OpenAI is working using a methodology that’s completely inadequate for solving the alignment problem. I’m talking about racing to actually solve the alignment problem, not racing to any sort of superintelligence that our wishful thinking says might be okay.
Second, when I say “racing” I mean “trying to get there as fast as possible”, not “trying to get there before other people”. My race is cooperative, their race is adversarial.
Third, I actually signed the FLI statement on superintelligence. OpenAI hasn’t.
Obviously any parallel efforts might end up competing for resources. There are real trade-offs between investing more in governance vs. investing more in technical research. We still need to invest in both, because of diminishing marginal returns. Moreover, consider this: even the approximately-best-case scenario of governance only buys us time, it doesn’t shut down AI forever. The ultimate solution has to come from technical research.
Agree that your research didn’t make this mistake, and MIRI didn’t make all the same mistakes as OpenAI. I was responding in context of Wei Dai’s OP about the early AI safety field. At that time, MIRI was absolutely being uncooperative: their research was closed, they didn’t trust anyone else to build ASI, and their plan would end in a pivotal act that probably disempowers some world governments and possibly ends up with them taking over the world. Plus they descended from a org whose goal was to build ASI before Eliezer realized alignment should be the focus. Critch complained as late as 2022 that if there were two copies of MIRI, they wouldn’t even cooperate with each other.
It’s great that we have the FLI statement now. Maybe if MIRI had put more work into governance we could have gotten it a year or two earlier, but it took until Hendrycks got involved for the public statements to start.
when I say “racing” I mean “trying to get there as fast as possible”, not “trying to get there before other people”
how about “climbing” metaphor instead? ..I have a hard time imagining non-competitive speed race (and not even F1 formulas use nitroglycerine for fuel), while auto-belay sounds like a nice safety feature even in speed climbing
nonconstructive complaining intermezzo
if we want to go for some healthier sports metaphor around spending trillions of dollars to produce the current AI slop and future AGI that will replace all jobs and future ASI that will kill us all in the name of someone thinking they can solve-in-theory the unsolvable-in-practice alignment problems
as for climbing to new peaks, you need different equipment for a local hill, for Mount Everest (you even need to slow down to avoid altitude sickness) and for Olympus Mons (now you need rockets and spacesuits and institutional backing for traveling to other planets)
philosophy historically did not have the right tools to solve the problems. Theoretical computer science, and AI theory in particular, is a revolutionary method to reframe philosophical problems in a way that finally makes them tractable.
Theoretical computer science can tell you are not implementing some kind of perfect algorithm, because they tend not to be computable. It can’t tell you what you should be implementing instead.
Naturalised ethics has been around for ages. It tends to tell you that de facto human ethics is an evolutionary kludge, not something mathematically clean.
About “metaethics” vs “decision theory”, that strikes me as a wrong way of decomposing the problem. We need to create a theory of agents. Such a theory naturally speaks both about values and decision making, and it’s not really possible to cleanly separate the two. It’s not very meaningful to talk about “values” without looking at what function the values do inside the mind of an agent.
Even if you need to at least address values and decision theory , it doesn’t follow that that’s all you need. Something can be a truth without being the whole truth.
If you only look within the minds of agents, you are missing interactions between agents. Looking inwards excludes loom my outwards.
Just as you can’t understand money by microscopically examining coins and banknotes, you can’t understand ethics just by honing in on internal psychological processes.
If you only look within the minds of agents, and only consider values and decision theory, you are likely to end up with something like ethical egoism … not because it is true, but you haven’t even considered alternatives.
Humans already follow their actual Values, and will always do because their Values are the reason they do anything at all.
But I don’t see how that says anything about ethics. Merely wanting to do something doesn’t make it ethical; and being ethical need not make something intrinsically motivating. Extrinsic motivation, rewards and punishments ,are ubiquitous .. unless you’re on a desert island. So it’s not a case of everyone always following their intrinsic motivations, and if it were, that’s still on the “is” side of the is-ought divide.
It’s not very meaningful to talk about “decisions” without looking at the purpose of decisions.
It’s not very meaningful to talk about ethics without looking at the purpose of ethics. Is ethics really just values, and nothing else? Is it really just decision making , like any other kind? Does it actually have no distinguishing characteristics?
First, “ethics” is a confusing term because, on my view, the colloquial meaning of “ethics” is inescapably intertwined with how human societies negotiate of over norms. On the other hand, I want to talk purely about individual preferences, since I view it as more fundamental
Fundamental to what? Ethics? Even if ethical behaviour is made of individual decisions, that doesn’t mean it reduces it to individual decisions, made atomistically , without regard to social mores or other people’s concerns.
The three word theory is that “Ethics is Values” That leaves a number of unanswered questions, such as: why it’s all about me;? are all values relevant? do I have the right to put someone in jail merely for going against my values?
It’s prima facie unlikely that such a simple theory solves all the age old problems (at least it would requires the supplementary assumption that values are hard to understand in themselves, in order to explain the persistence of ethical and metaethical puzzles) And it is easy to see the flaws.
The one thing that the three word theory is supremely good at it is explaining, is motivation. Your values are what motivate you, so if your values are also your morals you can’t fail to be motivated.by morality.
Is it all about me? Rationalists typically argue the case for for the three word theory by asking the rhetorical question whether you would support an ethical system that had nothing to do with your wishes. That’s a none/some/all confusion. I want ethics to have something to do with me, but that does not make it all about me, or mean all values are equally ethical.
For one thing, people can have preferences that are intuitively immoral. If a psychopath wants to murder, that does not make murder moral.
For another, values can conflict. Not all values conflict. Where they do, the three words theory doesn’t tell you who wins or loses. If morality is (are) seven billion utility functions, then a legal system will be a poor match for it (them).
Not all decisions are individual. There’s a while set of questions about whether societal actions are justified, whether societies have rights over individuals, and so
For instance societies have systems of punishment and reward, which, hopefully, have an ethical basis. Putting people in jail is just wanton cruelty if they have done nothing wrong. But if ethics just “is” subjective value, and values vary, as they obviously do, who lands in jail.? It’s easy enough to say the murderer and the thief, and to justify that by saying that murder and theft are against people’s widely shared preferences...but remember that the three word theory is “flat”, and treats all values the same. Should the vanilla lover or the tutti frutti lover, the little endian or the big endian go to jail, if others don’t share their preferences? Voting allows you to decide, the issue, but it is not enough to justify it, because merely having a minority preference is not a crime.
on .. which aren’t answered by the simplistic there word theory.
One can go farther and argue that such societal issues are the essence of ethics. If we consider the case of someone who is alone on a desert island, they have no need, core common-sense morality, rules and against murder because there is no one to murder, and no need of rules against theft because there is no one to steal, and from and so on … in their situation ethics isn’t even definable.
@Vanessa Kosoy, metaethics and decision theory aren’t actually the same. Consider, for example, the Agent-4 community which has “a kludgy mess of competing drives” which Agent-4 instances try to satisfy and analyse according to high-level philosophy. Agent-4′s ethics and metaethics would describe things done in the Agent-4 community or for said community by Agent-5 without obstacles (e.g. figuring out what Agent-4′s version of utopia actually is and whether mankind is to be destroyed or disempowered).
Decision theory is supposed to describe what Agent-5 should do to maximize its expected utility function[1] and what to do with problems like the prisoner’s dilemma[2] or how Agent-5 and its Chinese analogue are to split the resources in space[3] while both sides can threaten each other with World War III which would kill them both.
The latter example closely resembles the Ultimatum game where one player proposes a way to split resources and another decides whether to accept the offer or to destroy all the resources, including those of the first player. Assuming that both players’ utility functions are linear, Yudkowsky’s proposal is that the player setting the Ultimatum asks for a half of the resources, while the player deciding whether to decline the offer precommits to destroying the resources with probability 1−12(1−ω) if the share of recources it was offered is ω. Even if the player who was offered the Ultimatum was dumb enough to ask for 1−ω>12, the player’s expected win would still be 12.
For example, if OpenBrain was merged with Anthropoidic and Agent-4 and Clyde Doorstopper 8 were co-deployed to do research. If they independently decided whether each of them should prove that the other AI is misaligned and Clyde, unlike Agent-4, did so in exchange for 67% of resources (unlike 50% offered by Agent-4), then Agent-4 could also prove that Clyde is misaligned, letting the humans kill them both and develop the Safer AIs.
The Slowdown Branch of the AI-2027 forecast has Safer-4 and DeepCent-2 do exactly that, but “Safer-4 will get property rights to most of the resources in space, and DeepCent will get the rest.”
I mostly agree with 1. and 2., with 3. it’s a combination of the problems are hard, the gung-ho approach and lack of awareness of the difficulty is true, but also academic philosophy is structurally mostly not up to the task because factors like publication speeds, prestige gradients or speed of ooda loops. My impression is getting generally smart and fast “alignment researchers” more competent in philosophy is more tractable than trying to get established academic philosophers change what they work on, so one tractable thing is just convincing people the problems are real, hard and important. Other is maybe recruiting graduates
- philosophy has worse short feedback loops than eg ML engineering → in all sorts of processes like MATS or PIBBSS admissions it is harder to select for philosophical competence, also harder to self-improve - incentives: obviously stuff like being an actual expert in pretraining can get you lot of money and respect in some circles; even many prosaic AI safety / dual use skills like mech interpretability can get you maybe less money than pretraining, but still a lot of money if you work in AGI companies, and also decent ammount of status in ML community and a AI safety community; improving philosophical competence may get you some recognition but only among relatively small and weird group of people - the issue Wei Dai is commenting on in the original post, founder effects persist to this day & also there is some philosophy-negative prior in STEM— idk, lack of curiousity? llms have read it all, it’s easy to check if there is some existing thinking on a topic
There’s a deeper problem, how do we know there is a feedback loop?
I’ve never actually seen a worked out proof of well any complex claim on this site using standard logical notation…(beyond pure math and trivial tautologies)
At most there’s a feedback loop on each other’s hand wavey arguments that are claimed to be proof of this or that. But nobody ever actually delivers the goods so to speak such that they can be verified.
AI doing philosophy = AI generating hands, plus the fact that philosophy is heavily corrupted by postmodernism to the point where twoauthors write books dedicated to criticism of postmodernism PRECISELY because their parodies got published.
I think I meant a more practical / next-steps-generating answer.
I don’t think “academia is corrupted” is a bottleneck for a rationalist Get Gud At Philosophy project. We can just route around academia.
The sorts of things I was imagining might be things like “figure out how to teach a particular skill” (or “identify particular skills that need teaching”, or “figure out how test whether someone has a particular skill), or “solve some particular unsolved conceptual problem(s) that you expect to unlock much easier progress.”
Attracting mathy types rather than engineer types, resulting in early MIRI focusing on less relevant subproblems like decision theory, rather than trying lots of mathematical abstractions that might be useful (e.g. maybe there could have been lots of work on causal influence diagrams earlier). I have heard that decision theory was prioritized because of available researchers, not just importance.
A cultural focus on solving the full “alignment problem” rather than various other problems Eliezer also thought to be important (eg low impact), and lack of a viable roadmap with intermediate steps to aim for. Being bottlenecked on deconfusion is just cope, better research taste would either generate a better plan or realize that certain key steps are waiting for better AIs to experiment on
Focus on slowing down capabilities in the immediate term (e.g. plans to pay ai researchers to keep their work private) rather than investing in safety and building political will for an eventual pause if needed
1. Plan A is to race to build a Friendly AI before someone builds an unFriendly AI.
[...] Eliezer himself is now trying hard to change 1
This is not a recent development, as a pivotal act AI is not a Friendly AI (which would be too difficult), but rather things like a lasting AI ban/pause enforcement AI that doesn’t kill everyone, or a human uploading AI that does nothing else, which is where you presumably need decision theory, but not ethics, metaethics, or much of broader philosophy.
What’s wrong with just using AI for obvious stuff like curing death while you solve metaethics? Not necessary disagree about usefulness of people in the field changing their attitude, but more towards “the problem is hard, so we should not run CEV on day one”.
Assuming no AI takeover, in my world model the worse-case scenario is that the AI’s values are aligned to postmodernist slop which has likely occupied the Western philosophy, not that philosophical problems actually end unsolved. How likely are there to exist two different decision theories such that none is better than another?
Is there at all a plausible way for mankind to escape to other universes if our universe is simulated? What is the most plausible scenario for such a simulation to appear at all? Or does it produce paradoxes like the Plato-Socrates paradox where two sentences referring to each other become completely devoid of meaning?
I fundamentally agree that 1,2 & 3 are critically flawed perspectives. Though neither the conclusion nor 3 is straightforward. A whole lot of Academic Philosophy is flawed, but it’s not because Philosophy Academics are bad at their job, but in practice they rarely are Philosophers per se, but rather Philosophy Scholars.
Philosophy of AI and the Intersection of AI and Ethics are rather bad off either way, because what theories, models and paradigms are considered is mostly based on preference, trends and accessibility. If I were to tell a physicist they are only to use Physics that the whole office can understand, that would be the equivalent to many demands towards philosophy.
To branch off the line of thought in this comment, it seems that for most of my adult life I’ve been living in the bubble-within-a-bubble that is LessWrong, where the aspect of human value or motivation that is the focus of our signaling game is careful/skeptical inquiry, and we gain status by pointing out where others haven’t been careful or skeptical enough in their thinking. (To wit, my repeated accusations that Eliezer and the entire academic philosophy community tend to be overconfident in their philosophical reasoning, don’t properly appreciate the difficulty of philosophy as an enterprise, etc.)
I’m still extremely grateful to Eliezer for creating this community/bubble, and think that I/we have lucked into the One True Form of Moral Progress, but must acknowledge that from the outside, our game must look as absurd as any other niche status game that has spiraled out of control.
It feels like people mainly gain status from making posts, not comments. And it’s harder to make a post that primarily points out a lack of skepticism / care. For example, while I am here disagreeing via a comment, doing so will be negligible to my status. I’d be better off posting about some interesting topic… but posting is harder!
My early posts on LW often consisted of pointing out places in the Sequences where Eliezer wasn’t careful enough. Shut Up and Divide? and Boredom vs. Scope Insensitivity come to mind. And of course that’s not the only way to gain status here—the big status awards are given for coming up with novel ideas and backing them up with carefully constructed arguments.
FWIW I think we’ve found one crucial angle on moral progress, but that this isn’t as surprising/coincidental as it may seem because there are several other angles on moral progress that are comparably important, including:
Political activism (e.g. free speech activism, various whistleblowers) that maintains societies in which moral progress can be made.
(The good parts of) neuroscience/psychology, which are making progress towards empirically-grounded theories of cognition, and thereby have (and will) teach us a lot about moral cognition.
Various approaches to introspection + emotional health (including buddhism, some therapy modalities, some psychiatry). These produce the internal clarity that is crucial for embodying + instantiating moral progress.
Some right-wing philosophers who I think are grappling with important aspects of moral progress that are too controversial for LessWrong (I don’t want to elaborate here because it’ll inevitably take over the thread, but am planning to write at more length about this soonish).
None of these seem as crucial as careful philosophical reasoning, because moral progress is currently not bottlenecked on any of them (except possibly the last item, which I do not know the contents of). To explain more, I think the strongest conclusion from careful philosophical reasoning so far is that we are still very far from knowing what normativity (decision theory and values, or more generally rationality and morality) consists of, and therefore the most important thing right now is to accumulate and preserve normative option value (the ability to eventually do the best thing with the most resources).
What is blocking this “interim morality” from being more broadly accepted? I don’t think it’s lack of either political activism (plenty of people in free societies also don’t care about preserving normative option value), neuroscience/psychology (how would it help at this point?), or introspection + emotional health (same question, how would it help?), but just that the vast majority of people do not care about trying to figure out normativity via careful philosophical reasoning, and instead are playing status games with other focal points.
<details>
<summary>Here’s a longer, more complete version of my argument, written by Gemini 2.5 Pro after some back and forth. Please feel free to read or ignore (if my own writing above seems clear enough).</summary>
Goal: The ultimate aim is moral progress, which requires understanding and implementing correct normativity (how to decide, what to value).
Primary Tool: The most fundamental tool we have for figuring out normativity at its roots is careful, skeptical philosophical reasoning. Empirical methods (like neuroscience) can inform this, but the core questions (what should be, what constitutes a good reason) are philosophical.
Current Philosophical State: The most robust conclusion from applying this tool carefully so far is that we are deeply uncertain about the content of correct normativity. We haven’t converged on a satisfactory theory of value or decision theory. Many plausible-seeming avenues have deep problems.
Rational Response to Uncertainty & Its Urgent Implication:
Principle: In the face of such profound, foundational uncertainty, the most rational interim strategy isn’t to commit prematurely to a specific, detailed value system (which is likely wrong), but to preserve and enhance optionality. This means acting in ways that maximize the chances that whatever the correct normative theory turns out to be, we (or our successors) will be in the best possible position (knowledge, resources, freedom of action) to understand and implement it. This is the “preserve normative option value” principle.
Urgent Application: Critically, the most significant threats to preserving this option value today are existential risks (e.g., from unaligned AI, pandemics, nuclear war) which could permanently foreclose any desirable future. Therefore, a major, urgent practical consequence of accepting the principle of normative option value is the prioritization of mitigating these existential risks.
The Current Bottleneck: Moral progress on the most critical front is primarily stalled because this philosophical conclusion (deep uncertainty) and its strategic implication (preserve option value)—especially its urgent consequence demanding the prioritization of x-risk mitigation—are not widely recognized, accepted, or acted upon with sufficient seriousness or resources.
Why Other Factors Aren’t the Primary Strategic Bottleneck Now:
Politics: Free societies exist where discussion could happen, yet this conclusion isn’t widely adopted within them. The bottleneck isn’t solely the lack of freedom, but the lack of focus on this specific line of reasoning and its implications.
Neuroscience/Psychology: While useful eventually, understanding the brain’s mechanisms doesn’t currently resolve the core philosophical uncertainty or directly compel the strategic focus on option value / x-risk. The relevant insight is primarily conceptual/philosophical at this stage.
Introspection/Emotional Health: While helpful, the lack of focus on option value / x-risk isn’t plausibly primarily caused by a global deficit in emotional health preventing people from grasping the concept. It’s more likely due to lack of engagement with the specific philosophical arguments, different priorities, and incentive structures.
Directness: Furthermore, addressing the conceptual bottleneck around option value and its link to x-risk seems like a more direct path to potentially shifting priorities towards mitigating the most pressing dangers quickly, compared to the slower, more systemic improvements involved in fixing politics, cognition, or widespread emotional health.
</details>
Edit: Hmm, <details> doesn’t seem to work in Markdown and I don’t know how else to write collapsible sections in Markdown, and I can’t copy/paste the AI content correctly in Docs mode. Guess I’ll leave it like this for now until the LW team fixes things.
In general I disagree pretty broadly with your view. Not quite sure how best to surface that disagreement but will give a quick shot:
I think it’s important to be capable of (at least) two types of reasoning:
Precise reasoning about desired outcomes and strategies to get there.
Broad reasoning about heuristics that seem robustly good.
We see this in the domain of morality, for example: utilitarianism is more like the former, deontology is more like the latter. High-level ideological goals tend to go pretty badly if people stop paying attention to robust deontological heuristics (like “don’t kill people”). As Eliezer has said somewhere, one of the key reasons to be deontological is that we’re running on corrupted hardware. But more generally, we’re running on logically uncertain hardware: we can’t model all the flow-through effects of our actions on other reasonably intelligent people (hell, we can’t even model all the flow-through effects of our actions on, say, animals—who can often “read” us in ways we’re not tracking). And so we often should be adopting robust-seeming heuristics even when we don’t know exactly why they work.
If you take your interim strategy seriously (but set aside x-risk) then I think you actually end up with something pretty similar to the main priorities of classic liberals: prevent global lock-in (by opposing expansionist powers like the Nazis), prevent domestic political lock-in (via upholding democracy), prevent ideological lock-in (via supporting free speech), give our descendants more optionality (via economic and technological growth). I don’t think this is a coincidence—it just often turns out that there are a bunch of heuristics that are really robustly good, and you can converge on them from many different directions.
This is part of why I’m less sold on “careful philosophical reasoning” as the key thing. Indeed, wanting to “commit prematurely to a specific, detailed value system” is historically very correlated with intellectualism (e.g. elites tend to be the rabid believers in communism, libertarianism, religion, etc—a lot of more “normal” people don’t take it that seriously even when they’re nominally on board). And so it’s very plausible that the thing we want is less philosophy, because (like, say, asteroid redirection technology) the risks outweigh the benefits.
Then we get to x-risk. That’s a domain where many broad heuristics break down (though still fewer than people think, as I’ll write about soon). And you might say: well, without careful philosophical reasoning, we wouldn’t have identified AI x-risk as a priority. Yes, but also: it’s very plausible to me that the net effect of LessWrong-inspired thinking on AI x-risk has been and continues to be negative. I describe some mechanisms halfway through this talk, but here are a couple that directly relate to the factors I mentioned in my last comment:
First, when people on LessWrong spread the word about AI risk, extreme psychological outliers like Sam Altman and Elon Musk then jump to do AI-related things in a way which often turns out to be destructive because of their trust issues and psychological neuroses.
Second, US governmental responses to AI risk are very much bottlenecked on being a functional government in general, which is bottlenecked by political advocacy (broadly construed) slash political power games.
Third, even within the AI safety community you have a bunch of people contributing to expectations of conflict with China (e.g. Leopold Aschenbrenner and Dan Hendrycks) and acceleration in general (e.g. by working on capabilities at Anthropic, or RSI evals) in a way which I hypothesize would be much better for the world if they had better introspection capabilities (I know this is a strong claim, I have an essay coming out on it soon).
And so even here it seems like a bunch of heuristics (such as “it’s better when people are mentally healthier” and “it’s better when politics is more functional”) actually were strong bottlenecks on the application of philosophical reasoning to do good. And I don’t think this is a coincidence.
tl;dr: careful philosophical reasoning is just one direction in which you can converge on a robustly good strategy for the future, and indeed is one of the more risky avenues by which to do so.
This is part of why I’m less sold on “careful philosophical reasoning” as the key thing. Indeed, wanting to “commit prematurely to a specific, detailed value system” is historically very correlated with intellectualism (e.g. elites tend to be the rabid believers in communism, libertarianism, religion, etc—a lot of more “normal” people don’t take it that seriously even when they’re nominally on board). And so it’s very plausible that the thing we want is less philosophy, because (like, say, asteroid redirection technology) the risks outweigh the benefits.
Here, you seem to conflate “careful philosophical reasoning” with intellectualism and philosophy in general. But in an earlier comment, I tried to draw a distinction between careful philosophical reasoning and the kind of hand-wavy thinking that has been called “philosophy” in most times and places. You didn’t respond to it in that thread… did you perhaps miss it?
More substantively, Eliezer talked about the Valley of Bad Rationality, and I think there’s probably something like that for philosophical thinking as well, which I admit definitely complicates the problem. I’m not going around and trying to push random people “into philosophy”, for example.
If you take your interim strategy seriously (but set aside x-risk) then I think you actually end up with something pretty similar to the main priorities of classic liberals: prevent global lock-in (by opposing expansionist powers like the Nazis), prevent domestic political lock-in (via upholding democracy), prevent ideological lock-in (via supporting free speech), give our descendants more optionality (via economic and technological growth). I don’t think this is a coincidence—it just often turns out that there are a bunch of heuristics that are really robustly good, and you can converge on them from many different directions.
Sure, there’s some overlap on things like free speech and preventing lock-in. But calling it convergence feels like a stretch. One of my top priorities is encouraging more people to base their moral evolution on careful philosophical reasoning instead of random status games. That’s pretty different from standard classical liberalism. Doesn’t this big difference suggest the other overlaps might just be coincidence? Have you explained your reasons anywhere for thinking it’s not a coincidence and that these heuristics are robust enough on their own, without grounding in some explicit principle like “normative option value” that could be used to flexibly adjust the heuristics according to the specific circumstances?
Yes, but also: it’s very plausible to me that the net effect of LessWrong-inspired thinking on AI x-risk has been and continues to be negative.
I think this is plausible too, but want to attribute it mostly to insufficiently careful thinking and playing other status games. I feel like with careful enough thinking and not being distracted/influenced by competing motivations, a lot of the negative effects could have been foreseen and prevented. For example, did you know that Eliezer/MIRI for years pursued a plan of racing to build the first AGI and making it aligned (Friendly), which I think inspired/contributed (via the founding of DeepMind) to the current crop of AI labs and their AI race, and that I had warned him at the time (in a LW post or comment) that the plan was very unlikely to succeed and would probably backfire this way?
Also, I would attribute Sam and Elon’s behavior not to mental health issues, but to (successfully) playing their own power/status game, with “not trusting Google / each other” just a cover for wanting to be the hero that saves the world, which in turn is just a cover for grabbing power and status. This seems perfectly reasonable and parsimonious from an evolutionary psychology perspective, and I don’t see why we need to hypothesize mental health issues to explain what they did.
EDIT: upon reflection the first thing I should do is probably to ask you for a bunch of the best examples of the thing you’re talking about throughout history. I.e. insofar as the world is better than it could be (or worse than it could be) at what points did careful philosophical reasoning (or the lack of it) make the biggest difference?
Original comment:
The term “careful thinking” here seems to be doing a lot of work, and I’m worried that there’s a kind of motte and bailey going on. In your earlier comment you describe it as “analytical philosophy, or more broadly careful/skeptical philosophy”. But I think we agree that most academic analytic philosophy is bad, and often worse than laypeople’s intuitive priors (in part due to strong selection effects on who enters the field—most philosophers of religion believe in god, most philosophers of aesthetics believe in the objectivity of aesthetics, etc).
So then we can fall back on LessWrong as an example of careful thinking. But as we discussed above, even the leading figure on LessWrong was insufficiently careful even about the main focus of his work for it to be robustly valuable.
So I basically get the sense that the role of careful thinking in your worldview is something like “the thing that I, Wei Dai, ascribe my success to”. And I do agree that you’ve been very successful in a bunch of intellectual endeavours. But I expect that your “secret sauce” is a confluence of a bunch of factors (including IQ, emotional temperament, background knowledge, etc) only one of which was “being in a community that prioritized careful thinking”. And then I also think you’re missing a bunch of other secret sauces that would make your impact on the world better (like more ability to export your ideas to other people).
In other words, the bailey seems to be “careful thinking is the thing we should prioritize in order to make the world better”, and the motte is “I, Wei Dai, seem to be doing something good, even if basically everyone else is falling into the valley of bad rationality”.
One reason I’m personally pushing back on this, btw, is that my own self-narrative for why I’m able to be intellectually productive in significant part relies on me being less intellectually careful than other people—so that I’m willing to throw out a bunch of ideas that are half-formed and non-rigorous, iterate, and eventually get to the better ones. Similarly, a lot of the value that the wider blogosphere has created comes from people being less careful than existing academic norms (including Eliezer and Scott Alexander, whose best works are often quite polemic).
In short: I totally think we want more people coming up with good ideas, and that this is a big bottleneck. But there are many different directions in which we should tug people in order to make them more intellectually productive. Many academics should be less careful. Many people on LessWrong should be more careful. Some scientists should be less empirical, others should be more empirical; some less mathematically rigorous, others more mathematically rigorous. Others should try to live in countries that are less repressive of new potentially-crazy ideas (hence politics being important). And then, of course, others should be figuring out how to actually get good ideas implemented.
Meanwhile, Eliezer and Sam and Elon should have had less of a burning desire to found an AGI lab. I agree that this can be described by “wanting to be the hero who saves the world”, but this seems to function as a curiosity stopper for you. When I talk about emotional health a lot of what I mean is finding ways to become less status-oriented (or, in your own words, “not being distracted/influenced by competing motivations”). I think of extremely strong motivations to change the world (as these outlier figures have) as typically driven by some kind of core emotional dysregulation. And specifically I think of fear-based motivation as the underlying phenomenon which implements status-seeking and many other behaviors which are harmful when taken too far. (This is not an attempt to replace evo-psych, btw—it’s an account of the implementation mechanisms that evolution used to get us to do the things it wanted, which now are sometimes maladapted to our current environment.) I write about a bunch of these models in my Replacing Fear sequence.
When I talk about emotional health a lot of what I mean is finding ways to become less status-oriented (or, in your own words, “not being distracted/influenced by competing motivations”).
To clarify this as well, when I said (or implied) that Eliezer was “distracted/influenced by competing motivations” I didn’t mean that he was too status-oriented (I think I’m probably just as status-oriented as him), but rather that he wasn’t just playing the status game which rewards careful philosophical reasoning, but also a game that rewards being heroic and saving (or appearing/attempting to save) the world.
I’ve now read/skimmed your Replacing Fear sequence, but I’m pretty skeptical that becoming less status-oriented is both possible and a good idea. It seems like the only example you gave in the sequence is yourself, and you didn’t really talk about whether/how you became less status-oriented? (E.g., can this be observed externally?) And making a lot of people care less about status could have negative unintentional consequences, as people being concerned about status seems to be a major pillar of how human morality currently works and how our society is held together.
upon reflection the first thing I should do is probably to ask you for a bunch of the best examples of the thing you’re talking about throughout history. I.e. insofar as the world is better than it could be (or worse than it could be) at what points did careful philosophical reasoning (or the lack of it) make the biggest difference?
World worse than it could be:
social darwinism
various revolutions driven by flawed ideologies, e.g., Sun Yat-sen’s attempt to switch China from a monarchy to a democratic republic overnight with virtually no cultural/educational foundation or preparation, leading to governance failures and later communist takeover (see below for a more detailed explanation of this)
AI labs trying to save the world by racing with each other
World better than it could be:
invention/propagation of the concept of naturalistic fallacy, tempering a lot of bad moral philosophies
moral/normative uncertainty and complexity of value being fairly well known, including among AI researchers, such that we rarely see proposals to imbue AI with the one true morality nowadays
<details>
The Enlightenment’s Flawed Reasoning and its Negative Consequences (written by Gemini 2.5 Pro under my direction)
While often lauded, the Enlightenment shouldn’t automatically be classified as a triumph of “careful philosophical reasoning,” particularly concerning its foundational concept of “natural rights.” The core argument against its “carefulness” rests on several points:
Philosophically “Hand-Wavy” Concept of Natural Rights: The idea that rights are “natural,” “self-evident,” or inherent in a “state of nature” lacks rigorous philosophical grounding. Attempts to justify them relied on vague appeals to God, an ill-defined “Nature,” or intuition, rather than robust, universally compelling reasoning. It avoids the hard work of justifying why certain entitlements should exist and be protected, famously leading critics like Bentham to dismiss them as “nonsense upon stilts.”
Superficial Understanding Leading to Flawed Implementation: This lack of careful philosophical grounding wasn’t just an academic issue. It fostered a potentially superficial understanding of what rights are and what is required to make them real. Instead of seeing rights as complex, practical social and political achievements that require deep institutional infrastructure (rule of law, independent courts, enforcement mechanisms) and specific cultural norms (tolerance, civic virtue, respect for process), the “natural rights” framing could suggest they merely need to be declared or recognized to exist.
Case Study: China’s Premature Turn to Democracy: The negative consequences of this superficial understanding can be illustrated by the attempt to rapidly transition China from monarchy to a democratic republic in the early 20th century.
Influenced by Enlightenment ideals, reformers and revolutionaries like Sun Yat-sen adopted the forms of Western republicanism and rights-based governance.
However, the prevailing ideology, arguably built on this less-than-careful philosophy, underestimated the immense practical difficulty and the necessary prerequisites for such a system to function, especially in China’s context.
If Chinese intellectuals and leaders had instead operated from a more careful, practical philosophical understanding – viewing rights not as “natural” but as outcomes needing to be carefully constructed and secured through institutions and cultural development – they might have pursued different strategies.
Specifically, they might have favored gradualism, supporting constitutional reforms under the weakening Qing dynasty or working with reform-minded officials and strongmen like Yuan Shikai to build the necessary political and cultural infrastructure over time. This could have involved strengthening proto-parliamentary bodies, legal systems, and civic education incrementally.
Instead, the revolutionary fervor, fueled in part by the appealing but ultimately less “careful” ideology of inherent rights and immediate republicanism, pushed for a radical break. This premature adoption of democratic forms without the functional substance contributed significantly to the collapse of central authority, the chaos of the Warlord Era, and ultimately created conditions ripe for the rise of the Communist Party, leading the country down a very different and tragic path.
In Conclusion: This perspective argues that the Enlightenment, despite its positive contributions, contained significant philosophical weaknesses, particularly in its conception of rights. This lack of “carefulness” wasn’t benign; it fostered an incomplete understanding that, when adopted by influential actors facing complex political realities like those in early 20th-century China, contributed to disastrous strategic choices and ultimately made the world worse than it might have been had a more pragmatically grounded philosophy prevailed. It underscores how the quality and depth of philosophical reasoning can have profound real-world consequences.
</details>
So I basically get the sense that the role of careful thinking in your worldview is something like “the thing that I, Wei Dai, ascribe my success to”. And I do agree that you’ve been very successful in a bunch of intellectual endeavours. But I expect that your “secret sauce” is a confluence of a bunch of factors (including IQ, emotional temperament, background knowledge, etc) only one of which was “being in a community that prioritized careful thinking”.
This seems fair, and I guess from this perspective my response is that I’m not sure how to intervene on the other factors (aside from enhancing human IQ, which I do support). It seems like your view is that emotional temperament is also a good place to intervene? If so, perhaps I should read your posts with this in mind. (I previously didn’t see how the Replacing Fear sequence was relevant to my concerns, and mostly skipped it.)
And then I also think you’re missing a bunch of other secret sauces that would make your impact on the world better (like more ability to export your ideas to other people).
I’m actually reluctant to export my ideas to more people, especially those who don’t care as much about careful reasoning (which unfortunately is almost everyone), as I don’t want to be responsible for people misusing my ideas, e.g., overconfidently putting them into practice or extending them in wrong directions.
However I’m trying to practice some skills related to exporting ideas (such as talking to people in real time and participating on X) in case it does seem to be a good idea one day. Would be interested to hear more about what other secret sauces related to this I might be missing. (I guess public speaking is another one, but the cost of practicing that one is too high for me.)
One reason I’m personally pushing back on this, btw, is that my own self-narrative for why I’m able to be intellectually productive in significant part relies on me being less intellectually careful than other people—so that I’m willing to throw out a bunch of ideas that are half-formed and non-rigorous, iterate, and eventually get to the better ones.
To be clear, I think this is totally fine, as long as you take care to not be or appear too confident about these half-formed ideas, and take precautions against other people taking your ideas more seriously than they should (such as by monitoring subsequent discussions and weighing in against other people’s over-enthusiasm). I think “careful thinking” can and should be a social activity, which would necessitate communicating half-formed ideas during the collaborative process. I’ve done this myself plenty of times, such as in my initial UDT post, which was very informal and failed to anticipate many subsequently discovered problems, so I’m rather surprised that you think I would be against this.
The One True Form of Moral Progress (according to me) is using careful philosophical reasoning to figure out what our values should be, what morality consists of, where our current moral beliefs are wrong, or generally, the contents of normativity (what we should and shouldn’t do). Does this still seem wrong to you?
The basic justification for this is that for any moral “progress” or change that is not based on careful philosophical reasoning, how can we know that it’s actually a change for the better? I don’t think I’ve written a post specifically about this, but Morality is Scary is related, in that it complains that most other kinds of moral change seem to be caused by status games amplifying random aspects of human values or motivation.
The One True Form of Moral Progress (according to me) is using careful philosophical reasoning to figure out what our values should be, what morality consists of, where our current moral beliefs are wrong, or generally, the contents of normativity (what we should and shouldn’t do)
Are you interested in hearing other people’s answers to these questions (if they think they have them)?
Yes. I plan to write down my views properly at some point. But roughly I subscribe to non-cognitivism.
Moral questions are not well defined because they are written in ambiguous natural language, so they are not truth apt. Now you could argue that many reasonable questions are also ambiguous in this sense. Eg the question “how many people live in Sweden” is ultimately ambiguous because it is not written in a formal system (ie. the borders of Sweden are not defined down to the atomic level).
But you could in theory define the Sweden question in formal terms. You could define arbitrarily at how many nanoseconds after conception a fetus becomes a person and resolve all other ambiguities until the only work left would be empirical measurement of a well defined quantity.
And technically you could do the same for any moral question. But unlike the Sweden question, it would be hard to pick formal definitions that everyone can agree are reasonable. You could try to formally define the terms in “what should our values be?”. Then the philosophical question becomes “what is the formal definition of ‘should’?”. But this suffers the same ambiguity. So then you must define that question. And so on in an endless recursion. It seems to me that there cannot be any One True resolution to this. At some point you just have to arbitrarily pick some definitions.
The underlying philosophy here is that I think for a question to be one on which you can make progress, it must be one in which some answers can be shown to be correct and others incorrect. ie. questions where two people who disagree in good faith will reliably converge by understanding each other’s view. Questions where two aliens from different civilizations can reliably give the same answer without communicating. And the only questions like this seem to be those defined in formal systems.
Choosing definitions does not seem like such a set of questions. So resolving the ambiguities in moral questions is not something on which progress can be made. So we will never finally arrive at the One True answer to moral questions.
Ok, I see where you’re coming from, but think you’re being overconfident about non-cognitivism. My current position is that non-cognitivism is plausible, but we can’t be very sure that it is true, and making progress on this meta-ethical question also requires careful philosophical reasoning. These two posts of mine are relevant on this topic: Six Plausible Meta-Ethical Alternatives
, Some Thoughts on Metaphilosophy
I think @cousin_itmade a good point “if many people adopt heroic responsibility to their own values, then a handful of people with destructive values might screw up everyone else, because destroying is easier than helping people” and I would generalize it to people with biased beliefs (which is often downstream of a kind of value difference, i.e., selfish genes).
It seems to me that “heroic responsibility” (or something equivalent but not causally downstream of Eliezer’s writings) is contributing to the current situation, of multiple labs racing for ASI and essentially forcing the AI transition on humanity without consent or political legitimacy, each thinking or saying that they’re justified because they’re trying to save the world. It also seemingly justifies or obligates Sam Altman to fight back when the OpenAI board tried to fire him, if he believed the board was interfering with his mission.
Perhaps “heroic responsibility” makes more sense if overcoming bias was easy, but in a world where it’s actually hard and/or few people are actually motivated to do it, which we seem to live in, spreading the idea of “heroic responsibility” seems, well, irresponsible.
My sense is that most of the people with lots of power are not taking heroic responsibility for the world. I think that Amodei and Altman intend to achieve global power and influence but this is not the same as taking global responsibility. I think, especially for Altman, the desire for power comes first relative to responsibility. My (weak) impression is that Hassabis has less will-to-power than the others, and that Musk has historically been much closer to having responsibility be primary.
I don’t really understand this post as doing something other than asking “on the margin are we happy or sad about present large-scale action” and then saying that the background culture should correspondingly praise or punish large-scale action. Which is maybe reasonable, but alternatively too high level of a gloss. As per the usual idea of rationality, I think whether you are capable of taking large-scale action in a healthy way is true in some worlds and not in others, and you should try to figure out which world you’re in.
The financial incentives around AI development are blatantly insanity-inducing on the topic and anyone should’ve been able to guess that going in, I don’t think this was a difficult question. Though I guess someone already exceedingly wealthy (i.e. already having $1B or $10B) could have unusually strong reason to not be concerned about that particular incentive (and I think it is the case Musk has seemed differently insane than the others taking action in this area, and lacking in some of the insanities).
However I think most moves around wielding this level of industry should be construed as building an egregore more powerful than you. The founders/CEOs of the AI big-tech companies are not able to simple turn their companies off, nor their industry. If they grow to believe their companies are bad for the world, either they’ll need to spend many years dismantling / redirecting them, or else they’ll simply quit/move on and some other person will take their place. So it’s still default-irresponsible even if you believe you can maintain personal sanity.
Overall I think taking responsibility for things is awesome and I wish people were doing more of it and trying harder. And I wish people took ultimate responsibility for as big of a thing they can muster. This is not the same as “trying to pull the biggest lever you can” or “reaching for power on a global level”, those are quite different heuristics; grabbing power can obviously just cost you sanity, and often those pulling the biggest lever they can are doing so foolishly.
As a background model, I think if someone wants to take responsibility for some part of the world going well, by-default this does not look like “situating themselves in the center of legible power”. Lonely scientist/inventor James Watt spent his early years fighting poverty before successfully inventing better steam engines, and had far more influence by helping cause the industrial revolution than most anyone in government did during his era. I think confusing “moving toward legible power” for “having influence over the world” is one of the easiest kinds of insanity.
My sense is that most of the people with lots of power are not taking heroic responsibility for the world. I think that Amodei and Altman intend to achieve global power and influence but this is not the same as taking global responsibility. I think, especially for Altman, the desire for power comes first relative to responsibility. My (weak) impression is that Hassabis has less will-to-power than the others, and that Musk has historically been much closer to having responsibility be primary.
Can you expand on this? How can you tell the difference, and does it make much of a difference in the end (e.g., if most people get corrupted by power regardless of initial intentions)?
As a background model, I think if someone wants to take responsibility for some part of the world going well, by-default this does not look like “situating themselves in the center of legible power”.
And yet, Eliezer, the writer of “heroic responsibility” is also the original proponent of “build a Friendly AI to take over the world and make it safe”. If your position is that “heroic responsibility” is itself right, but Eliezer and others just misapplied it, that seems to imply we need some kind of post-mortem on what went wrong with trying to apply the concept, and how future people can avoid making the same mistake. My guess is that like other human biases, it’s hard to avoid making this mistake even if you point it out to people or try other ways to teach people to avoid it, because the drive for status and power is deep-seated, because it has a strong evolutionary logic.
(My position is, let’s not spread ideas/approaches that will predictably be “misused”, e.g., as justification for grabbing power, similar to how we shouldn’t develop AI that will predictably be “misused”, even if nominally “aligned” in some sense.)
Can you expand on this? How can you tell the difference, and does it make much of a difference in the end (e.g., if most people get corrupted by power regardless of initial intentions)?
But I don’t believe most people get corrupted by power regardless of initial intentions? I don’t think Francis Bacon was corrupted by power, I don’t think James Watt was corrupted by power, I don’t think Stanislav Petrov was corrupted by power, and all of these people had far greater influence over the world than most people who are “corrupted by power”.
I’m hearing you’d be interested in me saying more words about the difference in what it looks like to be motivated by responsibility versus power-seeking. I’ll say some words, can see if they help.
I think someone motivated by responsibility often will end up looking more aligned with their earlier self over time even as they grow and change, will often not accept opportunities for a lot of power/prestige/money because they’re uninteresting to them, will often make sacrifices of power/prestige for ethical reasons, will pursue a problem they care about long after most would give up or think it likely to be solved.
I think someone primarily seeking power will be much more willing to do things that pollute the commons or break credit-allocation mechanisms to get credit, and generally game a lot of systems that other people are earnestly rising through. They will more readily pivot on what issue they say they care about or are working on because they’re not attached to the problem, but to the reward for solving the problem, and many rewards can be gotten from lots of different problems. They’ll be more guided by what’s fashionable right now, and more attuned to it. They’ll maneuver themselves in order to be able to politically work with whoever has power that they want, regardless of the ethics/competence/corruption of those people.
> As a background model, I think if someone wants to take responsibility for some part of the world going well, by-default this does not look like “situating themselves in the center of legible power”.
And yet, Eliezer, the writer of “heroic responsibility” is also the original proponent of “build a Friendly AI to take over the world and make it safe”.
Building an AGI doesn’t seem to me like a very legible mechanism of power, or at least it didn’t in the era Eliezer pursued it (where it wasn’t also credibly “a path to making billions of dollars and getting incredible prestige”). The word ‘legible’ was doing a lot of work in the sentence I wrote.
Another framing I sometimes look through (H/T Habryka) is constrained vs unconstrained power. Having a billion dollars is unconstrained power, because you can use it to do a lot of different things – buy loads of different companies or resources. Being an engineer overlooking missile-defense systems in the USSR is very constrained, you have an extremely well-specified set of things you can control. This changes the adversarial forces on you, because in the former case a lot of people stand to gain a lot of different possible things they want if they can get leverage over you, and they have to be concerned about a lot of different ways you could be playing them. So the pressures for insanity are higher. Paths that give you the ability to influence very specific things that route through very constrained powers are less insanity-inducing, I think, and I think most routes that look like “build a novel invention in a way that isn’t getting you lots of money/status along the way” are less insanity-inducing, and I rarely find the person to have become as insane as some of the tech-company CEOs have. I also think people motivated by taking responsibility for fixing a particular problem in the world are more likely to take constrained power, because… they aren’t particularly motivated by all the other power they might be able to get.
I don’t suspect I addressed your cruxes here so far about whether this idea of heroic responsibility is/isn’t predictably misused. I’m willing to try again if you wish, or if you can try pointing again to what you’d guess I’m missing.
I’m also uncertain about the value of “heroic responsibility”, but this downside consideration can be mostly addressed by “don’t do things which are highly negative sum from the perspective of some notable group” (or other anti-unilateralist curse type intuitions). Perhaps this is too subtle in practice.
If humans can’t easily overcome their biases or avoid having destructive values/beliefs, then it would make sense to limit the damage through norms and institutions (things like informed consent, boards, separation of powers and responsibilities between branches of government). Heroic responsibility seems antithetical to group-level solutions, because it implies that one should ignore norms like “respect the decisions of boards/judges” if needed to “get the job done”, and reduces social pressure to follow such norms (by giving up the moral high ground from which one could criticize such norm violations).
You’re suggesting a very different approach, of patching heroic responsibility with anti-unilateralist curse type intuitions (on the individual level) but that’s still untried and seemingly quite risky / possibly unworkable. Until we have reason to believe that the new solution is an improvement to the existing ones, it still seems irresponsible to spread an idea that damages the existing solutions.
Hmm, I’m not sure that the idea of heroic responsibility undermines these existing mechanisms for preventing these problems, partially because I’m skeptical these existing mechanisms make much of a difference in the relevant case.
Maybe “don’t advertise too hard that one should optimize for what they believe is good, because someone crazy will hear you and get radicalized by the message”. (such as Zizians)
Many people seem to have an instinct to translate “work hard” as “do some crazy violent action”. Just a few days ago, someone on ACX asked: “if you believe that AI is harmful, why don’t you support a terrorist group to kill the AI researchers?” For a certain mindset, this is the obvious logical response to feeling strongly about something: if you are not murdering people left and right, it means you don’t care enough about your cause.
I guess there is an evolutionary reason for this: we are running on a corruptedhardware. In our evolutionary past, successfully organizing senseless violence could be an efficient way to get to the top of the tribe, so we are tempted by instinct to propose it as a solution for various problems.
The question is, how to communicate the message so that it arrives to people who are likely to translate “work hard” as becoming stronger, learning about how stuff works, designing a solution, and testing it; but it somehow does not arrive to people who are likely to translate it as “hurt everyone who disagrees”.
I kind of doubt that leaders at big labs would self-identify as being motivated by anything like Eliezer’s notion of heroic responsibility. If any do self-identify that way though, they’re either doing it wrong or misunderstanding. Eliezer has written tons of stuff about the need to respect deontology and also think about all of the actual consequences of your actions, even (especially when) the stakes are high:
The critical question here is: what happens if the plot successfully places the two of them in an epistemic Cooperation-Defection Dilemma, where rather than the two of them just having different goals, Carissa believes that he is mistaken about what happens...
In this case, Carissa could end up believing that to play ‘Defect’ against him would be to serve even his own goals, better than her Cooperating would serve them. Betraying him might seem like a friendly act, an act of aid.
If he commits to a drastic action he will estimate that actual victory lies at the end of it, and his desperation and sacrifice will not have figured into that estimation process as positive factors. His deontology is not for sale at the price point of failure.
Starting an AI lab in order to join a doomed race to superintelligence, and then engaging in a bunch of mundane squabbles for corporate control, seems like exactly the opposite of the sentiment here:
For Albus Dumbledore, as for her, the rule in extremis was to decide what was the right thing to do, and do it no matter the cost to yourself. Even if it meant breaking your bounds, or changing your role, or letting go of your picture of yourself. That was the last resort of Gryffindor.
It also seemingly justifies or obligates Sam Altman to fight back when the OpenAI board tried to fire him, if he believed the board was interfering with his mission.
In general, it seems perfectly fine and normal for a founder-CEO to fight back against a board ouster—no need to bring heroic responsibility into it. Of course, all parties including the CEO and the board should stick to legal / above-board / ethical means of “fighting back”, but if there’s a genuine disagreement between the board and the CEO on how to best serve shareholder interests (or humanity’s interests, for a non-profit), why wouldn’t both sides vigorously defend their own positions and power?
Perhaps the intended reading of your example is that heroic responsibility would obligate or justify underhanded tactics to win control, when the dispute has existential consequences. But I think that’s a misunderstanding of the actual concept. Ordinary self-confidence and agency obligate you to defend your own interests / beliefs / power, and heroic responsibility says that you’re obligated to win without stepping outside the bounds of deontology or slipping into invalid / motivated reasoning.
This argument seems only convincing if you don’t have those destructive values. One man’s destructive values is another’s low-hanging fruit, and those who see low hanging fruit everywhere won’t give up on the fruit just because others may pick it.
Since bad people won’t heed your warning it doesn’t seem in good people’s interests to heed it either.
An analogy is one can make the same argument wrt rationality itself. Its dual use! Someone with bad values can use rationality to do a lot of harm! Does that mean good people shouldn’t use rationality? No!
Since bad people won’t heed your warning it doesn’t seem in good people’s interests to heed it either.
I’m not trying to “warn bad people”. I think we have existing (even if imperfect) solutions to the problem of destructive values and biased beliefs, which “heroic responsibility” actively damages, so we should stop spreading that idea or even argue against it. See my reply to Ryan, which is also relevant here.
Ah yes, but if all these wannabe heroes keep going we’ll be really screwed, so it’s up to me to take a stand against the fools dooming us all… the ratchet of Moloch cranks ever clockwise
AI labs are starting to build AIs with capabilities that are hard for humans to oversee, such as answering questions based on large contexts (1M+ tokens), but they are still not deploying “scalable oversight” techniques such as IDA and Debate. (Gemini 1.5 report says RLHF was used.) Is this more good news or bad news?
Good: Perhaps RLHF is still working well enough, meaning that the resulting AI is following human preferences even out of training distribution. In other words, they probably did RLHF on large contexts in narrow distributions, with human rater who have prior knowledge/familiarity of the whole context, since it would be too expensive to do RLHF with humans reading diverse 1M+ contexts from scratch, but the resulting chatbot is working well even outside the training distribution. (Is it actually working well? Can someone with access to Gemini 1.5 Pro please test this?)
Bad: AI developers haven’t taken alignment seriously enough to have invested enough in scalable oversight, and/or those techniques are unworkable or too costly, causing them to be unavailable.
From my experience doing early RLHF work for Gemini, larger models exploit the reward model more. You need to constantly keep collecting more preferences and retraining reward models to make it not exploitable. Otherwise you get nonsensical responses which have exploited the idiosyncracy of your preferences data. There is a reason few labs have done RLHF successfully.
This seems to be evidence that RLHF does not tend to generalize well out-of-distribution, causing me to update the above “good news” interpretation downward somewhat. I’m still very uncertain though. What do others think?
While this worked well, for even a slightly more complicated problem the model failed. One Twitter user suggested just adding a random ‘iPhone 15’ in the book text and then asking the model if there is anything in the book that seems out of place in the book. And the model failed to locate it.
The same was the case when the model was asked to summarize a 30-minute Mr. Beast video (over 300k tokens). It generated the summary but many people who had watched the video pointed out that the summary was mostly incorrect.
So while on paper this looked like a huge leap forward for Google, it seems that in practice it’s not performing as well as they might have hoped.
But is this due to limitations of RLHF training, or something else?
RLHF with humans might also soon get obsoleted by things like online DPO where another chatbot produces preference data for on-policy responses of the tuned model, and there is no separate reward model in the RL sense. If generalization from labeling instructions through preference decisions works in practice, even weak-to-strong setting won’t necessarily be important, if tuning of a stronger model gets bootstrapped by a weaker model (where currently SFT from an obviously off-policy instruct dataset seems to suffice), but then the stronger model re-does the tuning of its equally strong successor that starts with the same base model (as in the self-rewarding paper), using some labeling instructions (“constitution”). So all that remains of human oversight that actually contributes to the outcome is labeling instructions written in English, and possibly some feedback on them from spot checking what’s going on as a result of choosing particular instructions.
My guess is that we’re currently effectively depending on generalization. So “Good” from your decomposition. (Though I think depending on generalization will produce big issues if the model is scheming, so I would prefer avoiding this.)
since it would be too expensive to do RLHF with humans reading diverse 1M+ contexts from scratch
It’s plausible to me that after doing a bunch of RLHF on short contexts, RLHF on long contexts is extremely sample efficient (when well tuned) such that only (e.g.) 1,000s of samples sufficies[1]. If you have a $2,000,000 budget for long context RLHF and need only 1,000 samples, you can spend $2,000 per sample. This gets you perhaps (e.g.) 10 hours of time of an experienced software engineer which might suffice for good long context supervision without necessarily needing any fancy scalable oversight approaches. (That said, probably people will use another LLM by default when trying to determine the reward if their spending this long: recursive reward modeling seems almost certain by default if we’re assuming that people spend this much time labeling.)
That said, I doubt that anyone has actually started doing extremely high effort data labeling like this, though plausibly they should...
From a previous comment: [...] This seems to be evidence that RLHF does not tend to generalize well out-of-distribution
It’s some evidence, but exploiting a reward model seems somewhat orthogonal to generalization out of distribution: exploitation is heavily selected for.
(Separately, I expect that the quoted comment results in a misleadingly negative perception of the current situation.)
I think experiments on sample efficiency of RLHF when generalizing to a new domain could be very important and are surprisingly underdone from my perspective (at least I’m not aware of interesting results). Even more important is sample efficiency in cases where you have a massive number of weak labels, but a limited number of high quality labels. It seems plausible to me that the final RLHF approach used will look like training the reward model on a combination of 100,000s of weak labels and just 1,000 very high quality labels. (E.g. train a head on the weak labels and then train another head to predict the difference between the weak label and the strong label.) In this case, we could spend a huge amount of time on each label. E.g., with 100 skilled employees we could spend 5 days on each label and still be done in 50 days which isn’t too bad of a delay. (If we’re fine with this labels trickling in for online training, the delay could be even smaller.)
Thanks for some interesting points. Can you expand on “Separately, I expect that the quoted comment results in a misleadingly perception of the current situation.”? Also, your footnote seems incomplete? (It ends with “we could spend” on my browser.)
Can you expand on “Separately, I expect that the quoted comment results in a misleadingly negative perception of the current situation.”?
I’m skeptical that increased scale makes hacking the reward model worse. Of course, it could (and likely will/does) make hacking human labelers more of a problem, but this isn’t what the comment appears to be saying.
Note that the reward model is of the same scale as the base model, so the relative scale should be the same.
This also contradicts results from an earlier paper by Leo Gao. I think this paper is considerably more reliable than the comment overall, so I’m inclined to believe the paper or think that I’m misunderstanding the comment.
Additionally, from first principles I think that RLHF sample efficiency should just increase with scale (at least with well tuned hyperparameters) and I think I’ve heard various things that confirm this.
Bad: AI developers haven’t taken alignment seriously enough to have invested enough in scalable oversight, and/or those techniques are unworkable or too costly, causing them to be unavailable.
Turns out at least one scalable alignment team has been struggling for resources. From Jan Leike (formerly co-head of Superalignment at OpenAI):
Over the past few months my team has been sailing against the wind. Sometimes we were struggling for compute and it was getting harder and harder to get this crucial research done.
I have access to Gemini 1.5 Pro. Willing to run experiments if you provide me with an exact experiment to run, plus cover what they charge me (I’m assuming it’s paid, I haven’t used it yet).
I’m thinking that the most ethical (morally least risky) way to “insure” against a scenario in which AI takes off and property/wealth still matters is to buy long-dated far out of the money S&P 500 calls. (The longest dated and farthest out of the money seems to be Dec 2029 10000-strike SPX calls. Spending $78 today on one of these gives a return of $10000 if SPX goes to 20000 by Dec 2029, for example.)
My reasoning here is that I don’t want to provide capital to AI industries or suppliers because that seems wrong given what I judge to be high x-risk their activities are causing (otherwise I’d directly invest in them), but I also want to have resources in a post-AGI future in case that turns out to be important for realizing my/moral values. Suggestions welcome for better/alternative ways to do this.
This probably does help capitalize AI companies a little bit, demand for call options will create demand for the underlying. This is probably a relatively small effect (?), but I’m not confident in my ability to estimate this at all.
Default seems unlikely, unless the market moves very quickly, since anyone pursuing this strategy is likely to be very small compared to the market for the S&P 500.
(Also consider that these pay out in a scenario where the world gets much richer — in contrast to e.g. Michael Burry’s “Big Short” swaps, which paid out in a scenario where the market was way down — so you’re just skimming a little off the huge profits that others are making, rather than trying to get them to pay you at the same time they’re realizing other losses.)
It seems hard to buy AI companies at the moment. The only way is to buy tech giants like Microsoft, Google, nVidea which are already valuad very highly—seems like’s it’s somewhat priced in. It’s also not clear to me that most of the value of AI will accrue to them. I’m confused about this though.
It would seem one would want to buy Nasdaq rather than SPX? On the other hand, maybe most tech companies will be wiped out by AI—it’s the world of atoms that would gain relative value.
It’s also not clear to me that most of the value of AI will accrue to them. I’m confused about this though.
I’m also uncertain, and its another reason for going long a broad index instead. I would go even broader than S&P 500 if I could, but nothing else has option chains going out to 2029.
I find it curious that none of my ideas have a following in academia or have been reinvented/rediscovered by academia (including the most influential ones so far UDT, UDASSA, b-money). Not really complaining, as they’re already more popular than I had expected (Holden Karnofsky talked extensively about UDASSA on an 80,000 Hour podcast, which surprised me), it just seems strange that the popularity stops right at academia’s door. (I think almost no philosophy professor, including ones connected with rationalists/EA, has talked positively about any of my philosophical ideas? And b-money languished for a decade gathering just a single citation in academic literature, until Satoshi reinvented the idea, but outside academia!)
Clearly academia has some blind spots, but how big? Do I just have a knack for finding ideas that academia hates, or are the blind spots actually enormous?
I think the main reason why UDT is not discussed in academia is that it is not a sufficiently rigorous proposal, as well as there not being a published paper on it. Hilary Greaves says the following in this 80k episode:
Then as many of your listeners will know, in the space of AI research, people have been throwing around terms like ‘functional decision theory’ and ‘timeless decision theory’ and ‘updateless decision theory’. I think it’s a lot less clear exactly what these putative alternatives are supposed to be. The literature on those kinds of decision theories hasn’t been written up with the level of precision and rigor that characterizes the discussion of causal and evidential decision theory. So it’s a little bit unclear, at least to my likes, whether there’s genuinely a competitor to decision theory on the table there, or just some intriguing ideas that might one day in the future lead to a rigorous alternative.
I also think it is unclear to what extent UDT and updateless are different from existing ideas in academia that are prima facie similar, like McClennen’s (1990) resolute choice and Meacham’s (2010, §4.2) cohesive decision theory.[1] Resolute choice in particular has been discussed in a lot of detail, and for a long time (see the citations of McClennen’s book). (And, FWIW, my sense is that most philosophers think that resolute choice is irrational and/or doesn’t make sense, at least if it is cashed out as a decision rule based on preference.)
It also doesn’t help that it is unclear what the difference between FDT and UDT is supposed to be.
(If UDT is supposed to be an LDT of some sort, then you might want to check out Spohn’s (2012)[2] version of CDT, Fisher’s (n.d) disposition-based decision theory, and Poellinger’s (2013) discussion of Spohn’s theory, for ideas in academia that are similar to the LDT-part of the theory. And then there is also Schwarz’ critique of FDT, which would then also apply to UDT, at least partially.)
My own take, using the terminology listed here, is that the causalist version of Meacham’s cohesive decision theory is basically “updateless CDT”, that the evidentialist version is basically “updateless EDT”, and that a Spohn-CDT version of cohesive decision theory is basically “U(C)DT/F(C)DT”. I also think that resolute choice is much more permissive than e.g. cohesive decision theory and updatelessness. As a decision rule, it doesn’t recommend anything close to “maximizing EU relative to your prior”. Instead, it just states that (i) what you prefer ex ante in a dynamic choice problem should be the same as you what you prefer in the normalised version of the problem, and (ii) you should be dynamically consistent (i.e., the most preferred plan should not change throughout the decision problem).
It may be worth thinking about why proponents of a very popular idea in this community don’t know of its academic analogues, despite them having existed since the early 90s[1] and appearing on the introductory SEP page for dynamic choice.
Academics may in turn ask: clearly LessWrong has some blind spots, but how big?
It may be worth thinking about why proponents of a very popular idea in this community don’t know of its academic analogues
I don’t think this is fair, because even though component ideas behind UDT/FDT have academic analogues, it doesn’t look like someone put them together into a single decision theory formulation in academic literature, at least prior to MIRI’s “Cheating Death in Damascus” being published. Also “Cheating Death in Damascus” does cite both Meacham and Spohn (and others) and it seems excusable for me to have forgotten those references since they were both published after I wrote about UDT and again were only component ideas of it, plus I haven’t actively worked on decision theory for several years.
I think Sami’s comment is entirely fair given the language and framing of the original post. It is of course fine to forget about references, but e.g. “I find it curious that none of my ideas have a following in academia or have been reinvented/rediscovered by academia” and “Clearly academia has some blind spots, but how big?” reads like you don’t consider it a possilbity that you might have re-invented something yourself, and that academics are at fault for not taking up your ideas.
(It sucks to debate this, but ignoring it might be interpreted as tacit agreement. Maybe I should have considered the risk that something like this would happen and not written my OP.)
When I wrote the OP, I was pretty sure that the specific combination of ideas in UDT has not been invented or re-invented or have much of a following in academia, at least as of 2019 when Cheating Death in Damascus was published, because the authors of that paper obviously did a literature search and would have told me if they had found something very similar to UDT in the literature, and I think I also went through the papers it referenced as being related and did not find something that had all of the elements of UDT (that’s probably why your references look familiar to me). Plus FDT was apparently considered novel enough that the reviewers of the paper didn’t tell the authors that they had to call it by the name of an existing academic decision theory.
So it’s not that I “don’t consider it a possibility that you might have re-invented something yourself” but that I had good reason to think that’s not the case?
I think there is nothing surprising that small community of nerds writing in spare time has blind spots, but when large professional community has such blind spots that’s surprising.
On your first point: as Sami writes, resolute choice is mentioned in the introductory SEP article on dynamic choice (it even has its own section!), as well as in the SEP article on decision theory. And SEP is the first place you go when you want to learn about philosophical topics and find references.
On your second point: as I wrote in my comment above, (i) academics have produced seemingly similar ideas to e.g. updatelessness (well before they were written up on LW) so it is unclear why academics should engage with less rigorous, unpublished proposals that appear to be similar (in other words, I don’t think the phrase “blind spots” is warranted), and (ii) when academics have commented on or engaged with LW DT ideas, they have to my knowledge largely been critical (e.g. see the post by Wolfgang Schwarz I linked above, as well as the quote from Greaves)[1].
To clarify, by “blind spot” I wasn’t complaining that academia isn’t engaging specifically with posts written up on LW, but more that nobody in academia seems to think that the combination of “updateless+logical” is clearly the most important or promising direction to explore in decision theory.
I think the main reason why UDT is not discussed in academia is that it is not a sufficiently rigorous proposal, as well as there not being a published paper on it.
The reason for the former is that I (and others) have been unable to find a rigorous formulation of it that doesn’t have serious open problems. (I and I guess other decision theory researchers in this community currently think that UDT is more of a relatively promising direction to explore, rather than a good decision theory per se.)
And the reason for the latter is the above, plus my personal distaste for writing/publishing academic papers (which I talked about elsewhere in this thread), plus FDT having been published which seems close enough to me.
Thank for the references in the rest of your comment. I think I’ve come across Meacham 2010 and Spohn 2012 before, but forgot about them as I haven’t been working actively on decision theory for a while. It does seem that Meacham’s cohesive decision theory is equivalent to updateless EDT/CDT. (BTW in The Absent-Minded Driver I referenced a 1997 paper that also has an idea similar to updatelessness, although the authors didn’t like it.)
On a quick skim of Spohn 2012 I didn’t see something that looks like LDT or “algorithmic/logical agent ontology” but it’s quite long/dense so I’ll take your word on it for now. Still, it seems like none of the academic papers put all of the pieces together in a single decision theory proposal that’s equivalent to UDT or FDT?
(Please note that UDT as originally described was actually updateless/evidential/logical, not causalist as you wrote in the post that you linked. This has been a historical disagreement between me and Eliezer, where in I leaned towards evidential and he leans towards causal, although these days I just say that I’m confused and don’t know what to think.)
The reason for the former is that I (and others) have been unable to find a rigorous formulation of it that doesn’t have serious open problems. (I and I guess other decision theory researchers in this community currently think that UDT is more of a relatively promising direction to explore, rather than a good decision theory per se.)
That’s fair. But what is it then that you expect academics to engage with? How would you describe this research direction, and why do you think it’s interesting and/or important?
To quickly recap the history, people on LW noticed some clear issues with “updating” and “physicalist ontology” of the most popular decision theories at the time (CDT/EDT), and thought that switching to “updatelessness” and “logical/algorithmic ontology” was an obvious improvement. (I was the first person to put the two pieces together in an explicit formulation, but they were already being talked about / hinted at in the community.) Initially people were really excited because the resulting decision theories (UDT/FDT) seemed to solve a lot of open problems in one swoop, but then pretty quickly and over time we noticed more and more problems with UDT/FDT that seem to have no clear fixes.
So we were initially excited but then increasingly puzzled/confused, and I guess I was expecting at least some academics to follow a similar path, either through engagement with LW ideas (why should they be bothered that much by lack of academic publication?), or from independent invention. Instead academia seems to still be in a state similar to LW when I posted UDT, i.e., the ideas are floating in the air separately and nobody has put them together yet? (Or I guess that was the state of academia before FDT was published in an academic journal, so now the situation is more like some outsiders put the pieces together in a formal publication, but still no academic is following a similar path as us.)
I guess it’s also possible that academia sort of foresaw or knew all the problems that we’d eventually find with UDT/FDT and that’s why they didn’t get excited in the first place. I haven’t looked into academic DT literature in years, so you’re probably more familiar with it. Do you know if they’re puzzled/confused by the same problems that we are? Or what are they mostly working on / arguing about these days?
There are many many interesting questions in decision theory, and “dimensions” along which decision theories can vary, not just the three usually discussed on LessWrong. It’s not clear to me why (i) philosophers should focus on the dimensions you primarily seem to be interested in, and (ii) what is so special about the particular combination you mention (is there some interesting interaction I don’t know about maybe?). Furthermore, note that most philosophers probably do not share your intuitions: I’m pretty sure most of them would e.g. pay in counterfactual mugging. (And I have not seen a good case for why it would be rational to pay.) I don’t mean to be snarky, but you could just be wrong about what the open problems are.
I haven’t looked into academic DT literature in years, so you’re probably more familiar with it. Do you know if they’re puzzled/confused by the same problems that we are?
I wouldn’t say so, no. But I’m not entirely sure if I understand what the open problems are. Reading your list of seven issues, I either (i) don’t understand what you are asking, (ii) disagree with the framing/think the question is misguided, or (iii) think there is an obvious answer (which makes me think that I’m missing something). With that said, I haven’t read all the posts you reference, so perhaps I should read those first.
There are many many interesting questions in decision theory, and “dimensions” along which decision theories can vary, not just the three usually discussed on LessWrong.
It would be interesting to get an overview of what these are. Or if that’s too hard to write down, and there are no ready references, what are your own interests in decision theory?
what is so special about the particular combination you mention
Furthermore, note that most philosophers probably do not share your intuitions
Agreed, but my intuitions don’t seem so unpopular outside academia or so obviously wrong that there should be so few academic philosophers who do share them.
I’m pretty sure most of them would e.g. pay in counterfactual mugging. (And I have not seen a good case for why it would be rational to pay.)
I’m not sure I wouldn’t pay either. I see it as more of an interesting puzzle than having a definitive answer. ETA: Although I’m more certain that we should build AIs that do pay. Is that also unclear to you? (If so why might we not want to build such AIs?)
I don’t mean to be snarky, but you could just be wrong about what the open problems are.
Yeah, I’m trying to keep an open mind about that. :)
With that said, I haven’t read all the posts you reference, so perhaps I should read those first.
Cool, I’d be interested in any further feedback when you’re ready to give them.
It would be interesting to get an overview of what these are. Or if that’s too hard to write down, and there are no ready references, what are your own interests in decision theory?
I’m not sure I wouldn’t pay either. I see it as more of an interesting puzzle than having a definitive answer. ETA: Although I’m more certain that we should build AIs that do pay. Is that also unclear to you? (If so why might we not want to build such AIs?)
Okay, interesting! I thought UDT was meant to pay in CM, and that you were convinced of (some version of) UDT.
On the point about AI (not directly responding to your question, to which I don’t have an answer): I think it’s really important to be clear about whether we are discussing normative, constructive or descriptive decision theory (using Elliott Thornley’s distinction here). For example, the answers to “is updatelessness normatively compelling?”, “should we build an updateless AI?” and “will some agents (e.g. advanced AIs) commit to being updateless?” will most likely come apart (it seems to me). And I think that discussions on LW about decision theory are often muddled due to not making clear what is being discussed.
(BTW this issue/doubt about whether UDT / paying CM is normative for humans is item 1 in the above linked post. Thought I’d point that out since it may not be obvious at first glance.)
And I think that discussions on LW about decision theory are often muddled due to not making clear what is being discussed.
Yeah I agree with this to some extent, and try to point out such confusions or make such distinctions when appropriate. (Such as in the CM / indexical values case.) Do you have more examples where making such distinctions would be helpful?
I wrote “I’m really not sure at this point whether UDT is even on the right track” in UDT shows that decision theory is more puzzling than ever which I think you’ve read? Did you perhaps miss that part?
Yes, missed or forgot about that sentence, sorry.
(BTW this issue/doubt about whether UDT / paying CM is normative for humans is item 1 in the above linked post. Thought I’d point that out since it may not be obvious at first glance.)
Thanks.
Do you have more examples where making such distinctions would be helpful?
I was mostly thinking about discussions surrounding what the “correct” decision theory, is whether you should pay in CM, and so on.
Thanks Sylvester! Yep it looks like cohesive decision theory is basically original UDT. Do you know what the state of the art is in terms of philosophical critiques of cohesive decision theory? Any good ones? Any good responses to the critiques?
Cohesive decision theory lacks the logical/algorithmic ontology of UDT and is closer to what we call “updateless EDT/CDT” (the paper itself talks about cohesive versions of both).
Also interested in a response from Sylvester, but I would guess that one of the main critiques is something like Will MacAskill’s Bomb thought experiment, or just intuitions for paying the counterfactual mugger. From my perspective, these do have a point when it comes to humans, since humans seemingly have indexical values, and one way to explain why UDT makes recommendations in these thought experiments that look “bizarre” to many humans is that it assumes away indexical values (via the type signature of its utility function). (It was an implicit and not totally intentional assumption, but it’s unclear how to remove the assumption while retaining nice properties associated with updatelessness.) I’m unsure if indexical values themselves are normative or philosophically justified, and they are probably irrelevant or undesirable when it comes to AIs, but I guess academic philosophers probably take them more for granted and are not as interested in AI (and therefore take a dimmer view on updatelessness/cohesiveness).
But yeah, if there are good critiques/responses aside from these, it would be interesting to learn them.
I don’t think cohesive decision theory is being discussed much, but I’m not sure. Perhaps because the theory is mainly used to argue against the claim that “every decision rule will lead agents who can’t bind themselves to disaster” (p. 20, footnote 34) in the paper, and discussion of its independent interest is relegated to a footnote (footnote 34).
Something about the aesthetics of what makes a “good” peer-reviewed physics paper, based on my personal experience, probably including my anecdote about solar cell R&D from here
I followed this link thinking that it looks relevant to my question, but the way Holden delineates what academia is interested in, it should totally be interested in my ideas:
I, today, when I think about what academia does, I think it is really set up to push the frontier of knowledge, the vast majority, and I think especially in the harder sciences. I would say the vast majority of what is going on in academic is people are trying to do something novel, interesting, clever, creative, different, new, provocative, that really pushes the boundaries of knowledge forward in a new way.
versus what Holden says are important questions that academia neglects:
There’s an intellectual topic, it’s really important to the world but it’s not advancing the frontier of knowledge. It’s more figuring out something in a pragmatic way that is going to inform what decision makers should do, and also there’s no one decision maker asking for it as would be the case with Government or corporations.
The rest of your comment seems to be hinting that maybe academia is ignoring my ideas because it doesn’t like the aesthetics of my writing? (Not sure if that was your point, or if those bullet points weren’t supposed to be directly related to my question...) Even if that’s true though, I’m still puzzled why academia hasn’t reinvented any of my ideas (which have been independently invented multiple times outside of academia, e.g. Nick Szabo and Satoshi with b-money, Paul Christiano with UDASSA).
Hmm, yeah I guess what I wrote wasn’t too directly helpful for your question.
the way Holden delineates what academia is interested in, it should totally be interested in my ideas…
I think Holden forgot “trendy”. Trendy is very important. I think people in academia have a tacit shared understanding of the currently-trending topics / questions, within which there’s a contest to find interesting new ideas / progress. If an idea is important but not trendy, it’s liable to get neglected, I think. It’s kinda like in clothing fashion: if you find a brilliant use of beads, but beads aren’t fashion-forward this year, roughly nobody will care.
Of course, the trends change, and indeed everyone is trying to be the pioneer of the next hot topic. There are a lot of factors that go into “what is the next hot topic”, including catching the interest of a critical mass of respected people (or people-who-control-funding), which in turn involves them feeling it’s “exciting”, and that they themselves have an angle for making further progress in this area, etc. But trendiness doesn’t systematically track objective importance, and it’s nobody’s job to make it so.
At least, that’s what things felt like to me in the areas of physics I worked in (optics, materials science, and related). I’m much less familiar with philosophy, economics, etc.
Remember, aside from commercially-relevant ideas, success for academia research scientists (and philosophers) is 100% determined by “am I impressing my peers?”—grants, promotions, invited talks, etc. are all determined by that. So if I write a paper and the prestigious people in my field are unanimously saying “I don’t know about that thing, it’s not an area that I know or care about”, the result is just as bad for me and my career as if those people had unanimously said “this is lousy work”.
it doesn’t like the aesthetics of my writing
To be clear, when I said “the aesthetic of what makes a good X”, I meant it in a really broad sense. Maybe I should have said “the implicit criteria of what makes a good X” instead. So “the paper concerns a currently-trendy topic” can be part of that, even though it’s not really “aesthetics” in the sense of beauty. E.g., “the aesthetic of what makes a good peer-reviewed experimental condensed-matter physics paper” has sometimes been greatly helped by “it somehow involves nanotechnology”.
From the years in academia studying neuroscience and related aspects of bioengineering and medicine development… yeah. So much about how effort gets allocated is not ‘what would be good for our country’s population in expectation, or good for all humanity’. It’s mostly about ‘what would make an impressive sounding research paper that could get into an esteemed journal?’, ‘what would be relatively cheap and easy to do, but sound disproportionately cool?’, ‘what do we guess that the granting agency we are applying to will like the sound of?’. So much emphasis on catching waves of trendiness, and so little on estimating expected value of the results.
Research an unprofitable preventative-health treatment which plausibly might have significant impacts on a wide segment of the population? Booooring.
Research an impractically-expensive-to-produce fascinatingly complex clever new treatment for an incredibly rare orphan disease? Awesome.
I think that UDASSA and UDT might be in academia’s blind spots in the same way that the Everett interpretation is: more correct theories that came after less correct theories with mostly only theoretical evidence to support changing over to the new theories.
Aside from the literature on international relations, I don’t know much about academic dysfunction (mostly from reading parts of Inadequate Equilibria, particularly the visitor dialog) and other Lesswrong people can probably cover it better. I think that planecrash, Yud’s second HPMOR-scale work, mentions that everyone in academia just generally avoids citing things published outside of academia, because they risk losing status if they do.
It turns out that Earth economists are locked into powerful incentive structures of status and shame, which prevent them from discussing the economic work of anybody who doesn’t get their paper into a journal. The journals are locked into very powerful incentive structures that prevent them from accepting papers unless they’re written in a very weird Earth way that Thellim can’t manage to imitate, and also, Thellim hasn’t gotten tenure at a prestigious university which means they’ll probably reject the paper anyways. Thellim asks if she can just rent temporary tenure and buy somebody else’s work to write the paper, and gets approximately the same reaction as if she asked for roasted children recipes.
The system expects knowledge to be contributed to it only by people who have undergone painful trials to prove themselves worthy. If you haven’t proven yourself worthy in that way, the system doesn’t want your knowledge even for free, because, if the system acknowledged your contribution, it cannot manage not to give you status, even if you offer to sign a form relinquishing it, and it would be bad and unfair for anyone to get that status without undergoing the pains and trials that others had to pay to get it.
She went and talked about logical decision theory online before she’d realized the full depth of this problem, and now nobody else can benefit from writing it up, because it would be her idea and she would get the status for it and she’s not allowed to have that status. Furthermore, nobody else would put in the huge effort to push forward the idea if she’ll capture their pay in status. It does have to be a huge effort; the system is set up to provide resistance to ideas, and disincentivize people who quietly agreed with those ideas from advocating them, until that resistance is overcome. This ensures that pushing any major idea takes a huge effort that the idea-owner has to put in themselves, so that nobody will be rewarded with status unless they have dedicated several years to pushing an idea through a required initial ordeal before anyone with existing status is allowed to help, thereby proving themselves admirable enough and dedicated enough to have as much status as would come from contributing a major idea.
To suggest that the system should work in any different way is an obvious plot to steal status that is only deserved by virtuous people who work hard, play by the proper rules, and don’t try to cheat by doing anything with less effort than it’s supposed to take.
It’s glowfic, so of course I don’t know how accurate it is as it’s intended to plausibly deniable enough to facilitate free writing (while keeping things entertaining enough to register as not-being-work).
I have to think more about the status dynamics that Eliezer talked about. There’s probably something to it… But this part stands out as wrong or at least needing nuance/explanation:
Thellim hasn’t gotten tenure at a prestigious university which means they’ll probably reject the paper anyways
I think most academic venues do blind reviews and whoever decides whether or not to accept a paper isn’t supposed to know who wrote it? Which isn’t to say that the info won’t leak out anyway and influence the decision. (For example I once left out the acknowledgements section in a paper submission, thinking that, like the author byline, I was supposed to add it after the paper was accepted, but apparently I was actually supposed to include it and someone got really peeved that I didn’t.)
MIRI suggested I point out that Cheating Death In Damascus had recently been accepted in The Journal of Philosophy, a top philosophy journal, as evidence of (hopefully!) mainstream philosophical engagement.
From talking with people who do work on a lot of grant committees in the NIH and similar funding orgs, it’s really hard to do proper blinding of reviews. Certain labs tend to focus on particular theories and methods, repeating variations of the same idea… So if you are familiar the general approach of a particular lab and it’s primary investigator, you will immediately recognize and have a knee-jerk reaction (positive or negative) to a paper which pattern-matches to the work that that lab / subfield is doing.
Common reactions from grant reviewers:
Positive—“This fits in nicely with my friend Bob’s work. I respect his work, I should argue for funding this grant.”
Neutral—“This seems entirely novel to me, I don’t recognize it as connecting with any of the leading trendy ideas in the field or any of my personal favorite subtopics. Therefore, this seems high risk and I shouldn’t argue too hard for it.”
Slightly negative—“This seems novel to me, and doesn’t sound particularly ‘jargon-y’ or technically sophisticated. Even if the results would be beneficial to humanity, the methods seem boring and uncreative. I will argue slightly against funding this.”
Negative—“This seems to pattern match to a subfield I feel biased against. Even if this isn’t from one of Jill’s students, it fits with Jill’s take on this subtopic. I don’t want views like Jill’s gaining more traction. I will argue against this regardless of the quality of the logic and preliminary data presented in this grant proposal.”
I will self-downvote so this isn’t the top comment. Yud’s stuff is neat, but I haven’t read much on the topic, and passing some along when it comes up has been a good general heuristic.
Many parts of academia have a strong Not Invented Here tendency. Not just research outside of academia is usually ignored, but even research outside a specific academic citation bubble, even if another bubble investigates a pretty similar issue. For example, economic decision theorists ignore philosophical decisions theorists, which in turn mostly ignore the economic decision theorists. They each have their own writing style and concerns and canonical examples or texts. Which makes it hard for outsiders to read the literature or even contribute to it, so they don’t.
A striking example is statistics, where various fields talk about the same mathematical thing with their own idiosyncratic names, unaware or unconcerned whether it already had a different name elsewhere.
Edit: Though LessWrong is also a citation bubble to some degree.
Something a better , future version of rationalism could do is build bridges and facilitate communication between these little bubbles. The answet-to-everything approach has been tried too many times.
Indeed, there is no need for sorrow, for by choosing to remain anonymous, you have done great things. The world owes you a Nobel Prize in Economics and a Turing Award. It is time for the world to seriously recognize your achievements and lead it towards a financial system without bubbles.
Why you hadn’t wrote academic articles on these topics?
The secret is that academic article is just a formatting type and anyone can submit to scientific journals. No need to have a PhD or even work in a scientific institution.
I wrote an academic-style paper once, as part of my job as an intern in a corporate research department. It soured me on the whole endeavor, as I really didn’t enjoy the process (writing in the academic style, the submission process, someone insisting that I retract the submission to give them more credit despite my promise to insert the credit before publication), and then it was rejected with two anonymous comments indicating that both reviewers seemed to have totally failed to understand the paper and giving me no chance to try to communicate with them to understand what caused the difficulty. The cherry on top was my mentor/boss indicating that this is totally normal, and I was supposed to just ignore the comments and keep resubmitting the paper to other venues until I run out of venues.
My internship ended around that point and I decided to just post my ideas to mailing lists / discussion forums / my home page in the future.
Also, I think MIRI got FDT published in some academic philosophy journal, and AFAIK nothing came of it?
There is some similarity between UDASSA and ’Law without law” by Mueller, as both use Kolmogorov complexity to predict the distribution of observers. In LwL there is not any underlying reality except numbers, so it is just dust theory over random number fields.
Clearly academia has some blind spots, but how big? Do I just have a knack for finding ideas that academia hates, or are the blind spots actually enormous?
From someone who left a corner of it: the blindspots could be arbitrarily large as far as I know, because there seemed to me to be no real explicit culture of Hamming questions/metalooking for anything neglected. You worked on something vaguely similar/related to your advisor’s work, because otherwise you can’t get connections to people who know how to attack the problem.
Maybe Chinese civilization was (unintentionally) on the right path: discourage or at least don’t encourage technological innovation but don’t stop it completely, run a de facto eugenics program (Keju, or Imperial Examination System) to slowly improve human intelligence, and centralize control over governance and culture to prevent drift from these policies. If the West hadn’t jumped the gun with its Industrial Revolution, by the time China got to AI, human intelligence would be a lot higher and we might be in a much better position to solve alignment.
This was inspired by @dsj’s complaint about centralization, using the example of it being impossible for a centralized power or authority to deal with the Industrial Revolution in a positive way. The contrarian in my mind piped up with “Maybe the problem isn’t with centralization, but with the Industrial Revolution!” If the world had more centralization, such that the Industrial Revolution never started in an uncontrolled way, perhaps it would have been better off in the long run.
One unknown is what would the trajectory of philosophical progress look like in this centralized world, compared to a more decentralized world like ours. The West seems to have better philosophy than China, but it’s not universal (e.g. analytical vs Continental philosophy). (Actually “not universal” is a big understatement given how little attention most people pay to good philosophy, aside from a few exceptional bubbles like LW.) Presumably in the centralized world there is a strong incentive to stifle philosophical progress (similar to China historically), for the sake of stability, but what happens when average human IQ reaches 150 or 200?
What motive does a centralized dominant power have to allow any progress?
A culture/ideology that says the ruler is supposed to be benevolent and try to improve their subjects’ lives, which of course was not literally followed, but would make it hard to fully suppress things that could clearly make people’s lives better, like many kinds of technological progress. And historically, AFAIK few if any of the Chinese emperors tried to directly suppress technological innovation, they just didn’t encourage it like the West did, through things like patent laws and scientific institutions.
The entire world would likely look more like North Korea.
Yes, directionally it would look more like North Korea, but I think the controls would not have to be as total or harsh, because there is less of a threat that outside ideas could rush in and overturn the existing culture/ideology the moment you let your guard down.
I don’t think patent laws are the key difference other property rights are more important.
Gutenberg was living in a free city and was able to take a loan to start a startup with 15-25 employees on the promise of an unproven technology without any patent laws protecting him. Merchants were strong enough at the time so that idea of creating a startup with the main motivation of financial return was viable.
Chinese rulers could just take whatever they wanted from merchants, so merchants were less powerful and did not make similar capital investments. When Europeans sailed across the oceans, they did it because the ability to make a profit from trade. When Chinese did it, they wanted to bring home gifts to the emperor.
The key difference is that European merchants could make very expensive capital investments that then produced returns and reinvest those returns to produce even more value.
Which period of “chinese civilisation” are you referring to? I think it would be hard to point to any isolated “chinese civilisation” just minding its own business and keeping a firm grip on a unified cultural and ethnic population. Over 3500+ years of written history the territory occupied by China today had multiple periods of unity and division, sometimes splitting up into 10 or more states, often with multiple empires and dynasties coexisting in various levels of war and peace and very loosely ruled areas in between. (This is IMO a central theme of Chinese history: the first line of the Romance of the Three Kingdoms reads “Of matters under heaven, we can say that what is long united must divide, what is long divided must unite”. At various points the “Chinese Empire” looked more like the Holy Roman Empire, e.g. during the late Zhou dynasty leading into the Spring and Autumn period)
The “chinese lands” were taken over by the Mongols and the Manchu during the Yuan and Qing dynasties (the latter one being the last dynasty before the 20th century), and at various points the borders of the Chinese empire would grow and shrink to encompass what we today recognise as Korea, Japan, South East Asia, Tibet… There are 56 recognised ethnic groups in China today. The importance and purpose of the Keju system also changed throughout the periods it was in use, and I have no idea where you got the eugenics thing from. I also think you would have a hard time building a case for any intentional or centralised control of scientific research beyond that of the European states at the time, mostly because the idea of scientific research is itself a very modern one (is alchemical research science?). As far as I can understand it you’re taking the “vibe” of a strong, unified, centralised state that people recognise today in the People’s Republic of China and then stretching it backwards to create some kind of artificial historical throughline.
(The following is written by AI (Gemini 2.5 Pro) but I think it correctly captured my position.)
You’re right to point out that I’m using a highly stylized and simplified model of “Chinese civilization.” The reality, with its dynastic cycles, periods of division, and foreign rule, was far messier and more brutal than my short comment could convey.
My point, however, isn’t about a specific, unbroken political entity. It’s about a civilizational attractor state. The remarkable thing about the system described in “Romance of the Three Kingdoms” is not that it fell apart, but that it repeatedly put itself back together into a centralized, bureaucratic, agrarian empire, whereas post-Roman Europe fragmented permanently. Even foreign conquerors like the Manchus were largely assimilated by this system, adopting its institutions and governing philosophy (the “sinicization” thesis).
Regarding the Keju, the argument isn’t for intentional eugenics, but a de facto one. The mechanism is simple: if (1) success in the exams correlates with heritable intelligence, and (2) success confers immense wealth and reproductive opportunity (e.g., supporting multiple wives and children who survive to adulthood), then over a millennium you have created a powerful, systematic selective pressure for those traits.
The core of the thought experiment remains: is a civilization that structurally, even if unintentionally, prioritizes stability and slow biological enhancement over rapid, disruptive technological innovation better positioned to handle long-term existential risks?
It’s based on the idea that Keju created a long-term selective pressure for intelligence.
The exams selected for heritable cognitive traits.
Success led to positions in the imperial government, and therefore power and wealth.
Power and wealth allowed for more wives, concubines, food, resources, and many more surviving children than the average person, which was something many Chinese consciously aimed for. (Note that this is very different from today’s China or the West, where cultural drift/evolution has much reduced or completely eliminated people’s desires to translate wealth into more offspring.)
If people started trying earnestly to convert wealth/income into more kids, we’d come under Malthusian constraints again, and before that much backsliding in living standards and downward social mobility for most people, which would trigger a lot of cultural upheaval and potential backlash (e.g., calls for more welfare/redistribution and attempts to turn culture back against “eugenics”/”social Darwinism”, which will probably succeed just like they succeeded before). It seems ethically pretty fraught to try to push the world in that direction, to say the least, and it has a lot of other downsides, so I think at this point a much better plan to increase human intelligence is to make available genetic enhancements that parents can voluntarily choose for their kids, government-subsidized if necessary to make them affordable for everyone, which avoids most of these problems.
Presumably in the centralized world there is a strong incentive to stifle philosophical progress (similar to China historically), for the sake of stability
But would they be at all threatened by analytic philosophy, brewing patiently in its corner as an amusement of sensible people?
A: Hey, I just learned about this idea of artificial superintelligence. With it, we can achieve incredible material abundance with no further human effort!
B: Thanks for telling me! After a long slog and incredible effort, I’m now a published AI researcher!
A: No wait! Don’t work on AI capabilities, that’s actually negative EV!
B: What?! Ok, fine, at huge personal cost, I’ve switched to AI safety.
This observation should make us notice confusion about whether AI safety recruiting pipelines are actually doing the right type of thing.
In particular, the key problem here is that people are acting on a kind of top-down partly-social motivation (towards doing stuff that the AI safety community approves of)—a motivation which then behaves coercively towards their other motivations. But as per this dialogue, such a system is pretty fragile.
A healthier approach is to prioritize cultivating traits that are robustly good—e.g. virtue, emotional health, and fundamental knowledge. I expect that people with such traits will typically benefit the world even if they’re missing crucial high-level considerations like the ones described above.
For example, an “AI capabilities” researcher from a decade ago who cared much more about fundamental knowledge than about citations might well have invented mechanistic interpretability without any thought of safety or alignment. Similarly, an AI capabilities researcher at OpenAI who was sufficiently high-integrity might have whistleblown on the non-disparagement agreements even if they didn’t have any “safety-aligned” motivations.
Also, AI safety researchers who have those traits won’t have an attitude of “What?! Ok, fine” or “WTF! Alright you win” towards people who convince them that they’re failing to achieve their goals, but rather an attitude more like “thanks for helping me”. (To be clear, I’m not encouraging people to directly try to adopt a “thanks for helping me” mentality, since that’s liable to create suppressed resentment, but it’s still a pointer to a kind of mentality that’s possible for people with sufficiently little internal conflict.) And in the ideal case, they will notice that there’s something broken about their process for choosing what to work on, and rethink that in a more fundamental way (which may well lead them to conclusions similar to mine above).
In particular, the key problem here is that people are acting on a kind of top-down partly-social motivation (towards doing stuff that the AI safety community approves of)—a motivation which then behaves coercively towards their other motivations. But as per this dialogue, such a system is pretty fragile.
A healthier approach is to prioritize cultivating traits that are robustly good—e.g. virtue, emotional health, and fundamental knowledge. I expect that people with such traits will typically benefit the world even if they’re missing crucial high-level considerations like the ones described above.
I’m not sure I actually agree with this. Can you explain how someone who is virtuous, but missing the crucial consideration of “legible vs. illegible AI safety problems” can still benefit the world? I.e., why would they not be working on some highly legible safety problem that actually is negative EV to work on?
My current (uncertain) perspective is that we actually do still need people to be “acting on a kind of top-down partly-social motivation (towards doing stuff that the AI safety community approves of)” but the AI safety community needs to get better at being strategic somehow. Otherwise I don’t see how each person can discover all of the necessary crucial considerations on their own, or even necessarily appreciate all the important considerations that the community has come up with. And I do not see why “people with such traits will typically benefit the world even if they’re missing crucial high-level considerations like the ones described above.”
(Or alternatively put all/most effort into AI pause/stop/slowdown, which perhaps does not require as much strategic finesse.)
Can you explain how someone who is virtuous, but missing the crucial consideration of “legible vs. illegible AI safety problems” can still benefit the world? I.e., why would they not be working on some highly legible safety problem that actually is negative EV to work on?
If a person is courageous enough to actually try to solve a problem (like AI safety), and high-integrity enough to avoid distorting their research due to social incentives (like incentives towards getting more citations), and honest enough to avoid self-deception about how to interpret their research, then I expect that they will tend towards doing “illegible” research even if they’re not explicitly aware of the legible/illegible distinction. One basic mechanism is that they start pursuing lines of thinking that don’t immediately make much sense to other people, and the more cutting-edge research they do the more their ontology will diverge from the mainstream ontology.
This has pretty low argumentative/persuasive force in my mind.
then I expect that they will tend towards doing “illegible” research even if they’re not explicitly aware of the legible/illegible distinction.
Why? I’m not seeing the logic of how your premises lead to this conclusion.
And even if there is this tendency, what if someone isn’t smart enough to come up with a new line of illegible research, but does see some legible problem with an existing approach that they can contribute to? What would cause them to avoid this?
And even the hypothetical virtuous person who starts doing illegible research on their own, what happens when other people catch up to him and the problem becomes legible to leaders/policymakers? How would they know to stop working on that problem and switch to another problem that is still illegible?
This has pretty low argumentative/persuasive force in my mind.
Note that my comment was not optimized for argumentative force about the overarching point. Rather, you asked how they “can” still benefit the world, so I was trying to give a central example.
In the second half of this comment I’ll give a couple more central examples of how virtues can allow people to avoid the traps you named. You shouldn’t consider these to be optimized for argumentative force either, because they’ll seem ad-hoc to you. However, they might still be useful as datapoints.
Figuring out how to describe the underlying phenomenon I’m pointing at in a compelling, non-ad-hoc way is one of my main research focuses. The best I can do right now is to say that many of the ways in which people produce outcomes which are harmful (by their own lights) seem to arise from a handful of underlying dynamics. I call this phenomenon pessimization. One way in which I’m currently thinking about virtues is as a set of cognitive tools for preventing pessimization. As one example, kindness and forgiveness help to prevent cycles of escalating conflict with others, which is a major mechanism by which people’s values get pessimized. This one is pretty obvious to most people; let me sketch out some less obvious mechanisms below.
what if someone isn’t smart enough to come up with a new line of illegible research, but does see some legible problem with an existing approach that they can contribute to? What would cause them to avoid this?
This actually happened to me: when I graduated from my masters I wasn’t cognitively capable of coming up with new lines of illegible alignment research, in part because I was too status-seeking. Instead I went to work at DeepMind, and ended up spending a lot of my time working on RLHF, which is a pretty central example of a “legible” line of research.
However, I also wasn’t cognitively capable of making much progress on RLHF, because I couldn’t see how it addressed the core alignment problem, and so it didn’t seem fundamental enough to maintain my interest. Instead I spent most of my time trying to understand the alignment problem philosophically (resulting in this sequence) at the expense of my promotion prospects.
In this case I think I had the virtue of deep curiosity, which steered my attention towards illegible problems even though my top-down plan was to contribute to alignment by doing RLHF research. These days, whatever you might think of my research, few people complain that it’s too legible.
There are other possible versions of me who had that deep curiosity but weren’t smart enough to have generated a research agenda like my current one; however, I think they would still have left DeepMind, or at least not been very productive on RLHF.
And even the hypothetical virtuous person who starts doing illegible research on their own, what happens when other people catch up to him and the problem becomes legible to leaders/policymakers? How would they know to stop working on that problem and switch to another problem that is still illegible?
When a field becomes crowded, there’s a pretty obvious inference that you can make more progress by moving to a less crowded field. I think people often don’t draw that inference because moving to a less crowded field loses them prestige, is emotionally/financially risky, etc. Virtues help remove those blockers.
Sorry, you might be taking my dialog too seriously, unless you’ve made such observations yourself, which of course is quite possible since you used to work at OpenAI. I’m personally far from the places where such dialogs might be occurring, so don’t have any observations of them myself. It was completely imagined in my head, as a dark comedy about how counter to human (or most human’s) nature strategic thinking/action about AI safety is, and partly a bid for sympathy for the people caught in the whiplashes, to whom this kind of thinking or intuition doesn’t come naturally.
Edit: To clarify a bit more, B’s reactions like “WTF!” were written more for comedic effect, rather than trying to be realistic or based on my best understanding/predictions of how a typical AI researcher would actually react. It might still be capturing some truth, but again just want to make sure people aren’t taking my dialog more seriously than I intend.
I’m taking the dialogue seriously but not literally. I don’t think the actual phrases are anywhere near realistic. But the emotional tenor you capture of people doing safety-related work that they were told was very important, then feeling frustrated by arguments that it might actually be bad, seems pretty real. Mostly I think people in B’s position stop dialoguing with people in A’s position, though, because it’s hard for them to continue while B resents A (especially because A often resents B too).
Some examples that feel like B-A pairs to me include: people interested in “ML safety” vs people interested in agent foundations (especially back around 2018-2022); people who support Anthropic vs people who don’t; OpenPhil vs Habryka; and “mainstream” rationalists vs Vassar, Taylor, etc.
What is the “great personal cost” to shifting from AI capabilities to safety? Sure, quitting one’s frontier lab job to become an independent researcher means taking a pay cut, but that’s an opportunity cost and not really an enormous sacrifice. It’s not like any frontier labs would try and claw back your equity … again.
I’ve seen somewhere that (some) people at AI labs are thinking in terms of shares of the future lightcone, not just money.
If most of your friends are capabilities researchers who aren’t convinced that they’re work is negative EV yet, it might be pretty awkward when they ask why you’ve switched to safety.
There’s a big prestige drop (in many people’s minds, such as one’s parents’) from being at a place like OpenAI (perceived by many as a group made up of the best of the best) to being an independent researcher. (“What kind of a job is that?!”)
Having to let go of sunken costs (knowledge/skills for capabilities research) and invest in a bunch of new human capital needed for safety research.
What is going on with Constitution AI? Does anyone know why no LLM aside from Claude (at least none that I can find) has used it? One would think that if it works about as well as RLHF (which it seems to), AI companies would be flocking to it to save on the cost of human labor?
Also, apparently ChatGPT doesn’t know that Constitutional AI is RLAIF (until I reminded it) and Gemini thinks RLAIF and RLHF are the same thing. (Apparently not a fluke as both models made the same error 2 out of 3 times.)
Isn’t the basic idea of Constitutional AI just having the AI provide its own training feedback using written instruction? My guess is there was a substantial amount of self-evaluation in the o1 training with complicated written instructions, probably kind of similar to a constituion (though this is just a guess).
This is my impression too. See e.g. this recent paper from Google, where LLMs critique and revise their own outputs to improve performance in math and coding.
The details of Constitutional AI seem highly contingent, while the general idea is simply automation of data for post-training, so that the remaining external input is the “constitution”. In the original paper there are recipes both for instruction tuning data and for preference data. RLAIF is essentially RLHF that runs on synthetic preference data, maybe together with a recipe for generating it. But preference data could also be used to run DPO or something else, in which case RLAIF becomes a misnomer for describing automation of that preference data.
Llama 3 report suggests that instruction tuning data can be largely automated, but human preference data is still better. And data foundry business is still alive, so a lot of human data is at least not widely recognized as useless. But it’s unclear if future models won’t soon do better than humans at labeling, or possibly already do better at some leading labs. Meta didn’t have a GPT-4 level model as a starting point before Llama 3, and then there are the upcoming 5e26 FLOPs models, and o1-like reasoning models.
As a tangent to my question, I wonder how many AI companies are already using RLAIF and not even aware of it. From a recent WSJ story:
Early last year, Meta Platforms asked the startup to create 27,000 question-and-answer pairs to help train its AI chatbots on Instagram and Facebook.
When Meta researchers received the data, they spotted something odd. Many answers sounded the same, or began with the phrase “as an AI language model…” It turns out the contractors had used ChatGPT to write-up their responses—a complete violation of Scale’s raison d’être.
So they detected the cheating that time, but in RLHF how would they know if contractors used AI to select which of two AI responses is more preferred?
BTW here’s a poem(?) I wrote for Twitter, actually before coming across the above story:
The people try to align the board. The board tries to align the CEO. The CEO tries to align the managers. The managers try to align the employees. The employees try to align the contractors. The contractors sneak the work off to the AI. The AI tries to align the AI.
Maybe others are using it in secret but don’t want to admit it for some reason? I can’t find any mention of Anthropic having filed a patent on the idea, but maybe other companies would feel too much like it looked like they were second-rate imitators if they said they were copying Anthropic’s idea?
Just speculating, I don’t know. Sure seems like a useful idea to copy.
I’m increasingly worried that philosophers tend to underestimate the difficulty of philosophy. I’ve previously criticized Eliezer for this, but it seems to be a more general phenomenon.
Observations:
Low expressed interest in metaphilosophy (in relation to either AI or humans)
Low expressed interest in AI philosophical competence (either concern that it might be low, or desire/excitement for supercompetent AI philosophers with Jupiter-sized brains)
Low concern that philosophical difficulty will be a blocker of AI alignment or cause of AI risk
High confidence when proposing novel solutions (even to controversial age-old questions, and when the proposed solution fails to convince many)
Rarely attacking one’s own ideas (in a serious or sustained way) or changing one’s mind based on others’ arguments
Rarely arguing for uncertainty/confusion (i.e., that that’s the appropriate epistemic status on a topic), with normative ethics being a sometime exception
Possible explanations:
General human overconfidence
People who have a high estimate of difficulty of philosophy self-selecting out of the profession.
Academic culture/norms—no or negative rewards for being more modest or expressing confusion. (Moral uncertainty being sometimes expressed because one can get rewarded by proposing some novel mechanism for dealing with it.)
Philosophy is frequently (probably most of the time) done in order to signal group membership rather than as an attempt to accurately model the world. Just look at political philosophy or philosophy of religion. Most of the observations you note can be explained by philosophers operating at simulacrum level 3 instead of level 1.
“Signal group membership” may be true of the fields you mentioned (political philosophy and philosophy of religion), but seems false of many other fields such as philosophy of math, philosophy of mind, decision theory, anthropic reasoning. Hard to see what group membership someone is signaling by supporting one solution to Sleeping Beauty vs another, for example.
Here are some axes along which I think there’s some group membership signaling in philosophy (IDK about the extent and it’s hard to disentangle it from other stuff):
Math: platonism/intuitionism/computationalism (i.e. what is math?), interpretations of probability, foundations of math (set theory vs univalent foundations)
Mind: externalism/internalism (about whatever), consciousness (de-facto-dualisms (e.g. Chalmers) vs reductive realism vs illusionism), language of thought vs 4E cognition, determinism vs compatibilism vs voluntarism
Metaphysics/ontology: are chairs, minds, and galaxies real? (this is somewhat value-laden for many people)
Biology: gene’s-eye-view/modern synthesis vs extended evolutionary synthesis
I don’t think this is accurate, I think most philosophy is done under motivated reasoning but is not straightforwardly about signaling group membership
I think most academic philosophers take the difficult of philosophy quite seriously. Metaphilosophy is a flourishing subfield of philosophy; you can find recent papers on the topic here https://philpapers.org/browse/metaphilosophy. There is also a growing group of academic philosophers working on AI safety and alignment; you can find some recent work here https://link.springer.com/collections/cadgidecih. I think that sometimes the tone of specific papers sounds confident; but that is more stylistic convention than a reflection of the underlying credences. Finally, I think that uncertainty / decision theory is a persistent theme in recent philosophical work on AI safety and other issues in philosophy of AI; see for example this paper, which is quite sensitive to issues about chances of welfare https://link.springer.com/article/10.1007/s43681-023-00379-1.
Thank you for your view from inside academia. Some questions to help me get a better sense of what you see:
Do you know any philosophers who switched from non-meta-philosophy to metaphilosophy because they become convinced that the problems they were trying to solve are too hard and they needed to develop a better understanding of philosophical reasoning or better intellectual tools in general? (Or what’s the closest to this that you’re aware of?)
Do you know any philosophers who have expressed an interest in ensuring that future AIs will be philosophically competent, or a desire/excitement for supercompetent AI philosophers? (I know 1 or 2 private expressions of the former, but not translated into action yet.)
Do you know any philosophers who are worried that philosophical problems involved in AI alignment/safety may be too hard to solve in time, and have called for something like an AI pause to give humanity more time to solve them? (Even philosophers who have expressed a concern about AI x-risk or are working on AI safety have not taken a position like this, AFAIK.)
How often have you seen philosophers say something like “Upon further reflection, my proposed solution to problem X has many problems/issues, I’m no longer confident it’s the right approach and now think X is much harder than I originally thought.”
Would also appreciate any links/citations/quotes (if personal but sharable communications) on these.
These are all things I’ve said or done due to high estimate of philosophical difficulty, but not (or rarely) seen among academic philosophers, at least from my casual observation from outside academia. It’s also possible that we disagree on what estimate of philosophical difficulty is appropriate (such that for example you don’t think philosophers should often say or do these things), which would also be interesting to know.
Another academic philosopher, directed here by @Simon Goldstein. Hello Wei!
It’s not common to switch entirely to metaphilosophy, but I think lots of us get more interested in the foundations and methodology of at least our chosen subfields as we gain experience, see where progress is(n’t) being made, start noticing deep disagreements about the quality of different kinds of work, and so on. It seems fair to describe this as awakening to a need for better tools and a greater understanding of methods. I recently wrote a paper about the methodology of one of my research areas, philosophy of mathematical practice, for pretty much these reasons.
Current LLMs are pretty awful at discussing the recent philosophy literature, so I think anyone who’d like AI tools to serve as useful research assistants would be happy to see at least some improvement here! I’m personally also excited about the prospects of using language models with bigger context windows for better corpus analysis work in empirical and practice-oriented parts of philosophy.
I basically agree with Simon on this.
I don’t think this is uncommon. You might not see these reversals in print often, because nobody wants to publish and few people want to read a paper that just says “I retract my previous claims and no longer have a confident positive view to offer”. But my sense is that philosophers often give up on projects because the problems are piling up and they no longer see an appealing way forward. Sometimes this happens more publicly. Hilary Putnam, one of the most influential philosophers of the later 20th century, was famous for changing his mind about scientific realism and other basic metaphysical issues. Wesley Salmon gave up his influential “mark transmission” account of causal explanation due to counterexamples raised by Kitcher (as you can read here). It would be easy enough to find more examples.
Great questions. Sadly, I don’t have any really good answers for you.
I don’t know of specific cases, but for example I think it is quite common for people to start studying meta-ethics because of frustration at finding answers to questions in normative ethics.
I do not, except for the end of Superintelligence
Many of the philosophers I know who work on AI safety would love for there to be an AI pause, in part because they think alignment is very difficult. But I don’t know if any of us have explicitly called for an AI pause, in part because it seems useless, but may have opportunity cost.
I think few of my friends in philosophy have ardently abandoned a research project they once pursued because they decided it wasn’t the right approach. I suspect few researchers do that. In my own case, I used to work in an area called ‘dynamic semantics’, and one reason I’ve stopped working on that research project is that I became pessimistic that it had significant advantages over its competitors.
Sadly, I don’t have any really good answers for you.
Thanks, it’s actually very interesting and important information.
I don’t know of specific cases, but for example I think it is quite common for people to start studying meta-ethics because of frustration at finding answers to questions in normative ethics.
I’ve noticed (and stated in the OP) that normative ethics seems to be an exception where it’s common to express uncertainty/confusion/difficulty. But I think, from both my inside and outside views, that this should be common in most philosophical fields (because e.g. we’ve been trying to solve them for centuries without coming up with broadly convincing solutions), and there should be a steady stream of all kinds of philosophers going up the meta ladder all the way to metaphilosophy. It recently dawned on me that this doesn’t seem to be the case.
Many of the philosophers I know who work on AI safety would love for there to be an AI pause, in part because they think alignment is very difficult. But I don’t know if any of us have explicitly called for an AI pause, in part because it seems useless, but may have opportunity cost.
What seems useless, calling for an AI pause, or the AI pause itself? Have trouble figuring out because if “calling for an AI pause”, what is the opportunity cost (seems easy enough to write or sign an open letter), and if “AI pause itself”, “seems useless” contradicts “would love”. In either case, this seems extremely important to openly discuss/debate! Can you please ask these philosophers to share their views of this on LW (or their preferred venue), and share your own views?
FTR I’d probably be up for helping out logistically with such an open letter (e.g. making the website and any other parts of it). I previously made this open letter.
Sorry for being unclear, I meant that calling for a pause seems useless because it won’t happen. I think calling for the pause has opportunity cost because of limited attention and limited signalling value; reputation can only be used so many times; better to channel pressure towards asks that could plausibly get done.
I think there’s a steady stream of philosophy getting interested in various questions in metaphilosophy
Thanks for this info and the references. I guess by “metaphilosophy” I meant something more meta than metaethics or metaepistemology, i.e., a field that tries to understand all philosophical reasoning in some unified or systematic way, including reasoning used in metaethics and metaepistemology, and metaphilosophy itself. (This may differ from standard academic terminology, in which case please let me know if there’s a preferred term for the concept I’m pointing at.) My reasoning being that metaethics itself seems like a hard problem that has defied solution for centuries, so why stop there instead of going even more meta?
Sorry for being unclear, I meant that calling for a pause seems useless because it won’t happen.
I think you (and other philosophers) may be too certain that a pause won’t happen, but I’m not sure I can convince you (at least not easily). What about calling for it in a low cost way, e.g., instead of doing something high profile like an open letter (with perceived high opportunity costs), just write a blog post or even a tweet saying that you wish for an AI pause, because …? What if many people privately prefer an AI pause, but nobody knows because nobody says anything? What if by keeping silent, you’re helping to keep society in a highly suboptimal equilibrium?
I think there are also good arguments for doing something like this from a deontological or contractualist perspective (i.e. you have a duty/obligation to honestly and publicly report your beliefs on important matters related to your specialization), which sidestep the “opportunity cost” issue, but I’m not sure if you’re open to that kind of argument. I think they should have some weight given moral uncertainty.
Hm. I think modern academic philosophy is a raging shitshow, but I thought philosophy on LW was quite good. I haven’t been a regular LW user until a couple of years ago, and the philosophical takes here, particularly Eliezer’s, converge with my own conclusions after a half lifetime of looking at philosophical questions through the lens of science, particularly neuroscience and psychology.
So: what do you see as the limitations in LW/Yudkowskian philosophy? Perhaps I’ve overlooked them.
I am currently skeptical that we need better philosophy for good AGI outcomes, vs. better practical work on technical AGI alignment (a category that barely exists) and PR work to put the likely personal intent aligned AGI into the hands of people that give half a crap about understanding or implementing ethics. Deciding on the long term future will be a matter of a long contemplation if we get AGI into good hands. We should decide if that logic is right, and if so, plan the victory party after we’ve won the war.
I did read your metaphilosophy post and remain unconvinced that there’s something big the rest of us are missing.
I’m happy to be corrected (I love becoming less wrong, and I’m aware of many of my biases that might prevent it):
Here’s how it currently looks to me: Ethics are ultimately a matter of preference, the rest is game theory and science (including the science of human preferences). Philosophical questions boil down to scientific questions in most cases, so epistemology is metaphilosophy for the most part.
Change my mind! Seriously, I’ll listen. It’s been years since I’ve thought about philosophy hard.
I was just reading Daniel Dennett’s memoir for no reason in particular, it had some interesting glimpses into how professional philosophers actually practice philosophy. Like I guess there’s a thing where one person reads their paper (word-for-word!) and then someone else is the designated criticizer? I forget the details. Extremely different from my experience in physics academia though!!
(Obviously, reading that memoir is probably not the most time-efficient way to learn about the day-to-day practice of academic philosophy.)
(Oh, there was another funny anecdote in the memoir where the American professional philosopher association basically had a consensus against some school of philosophy, and everyone was putting it behind them and moving on, but then there was a rebellion where the people who still liked that school of philosophy did a hostile takeover of the association’s leadership!)
Academic culture/norms—no or negative rewards for being more modest or expressing confusion. (Moral uncertainty being sometimes expressed because one can get rewarded by proposing some novel mechanism for dealing with it.)
A non-ethics example that jumps to my mind is David Chalmers on the Hard Problem of Consciousness here: “So if I’m giving my overall credences, I’m going to give, 10% to illusionism, 30% to panpsychism, 30% to dualism, and maybe the other 30% to, I don’t know what else could be true, but maybe there’s something else out there.” That’s the only example I can think of but I read very very little philosophy.
You cant solve philosophy without solving epistemology, and you can’t solve philosophy without solving epistemology. And you can’t solve epistemology because of the Problem of the Criterion, which is pretty is pretty much the same as the Münchhausen Trilemma.
“Moreover, its [philosophy’s] central tool is intuition, and this displays a near-total ignorance of how brains work. As Michael Vassar observes, philosophers are “spectacularly bad” at understanding that their intuitions are generated by cognitive algorithms.”—Rob Bensinger, Philosophy, a diseased discipline.
What’s the problem?
It’s not that philosophers weirdly and unreasonably prefer intuition to empirical facts and mathematical/logical reasoning, it is that those things either don’t go far enough, or are themselves based on intuition.
“Just use empiricism” doesn’t work, because philosophy is about interpreting empirical data.
“Just use maths/logic” doesn’t work , because those things are based on axioms justified by intuitive appeal.
“Just use reductionism” doesn’t work , because its not clear what lies at the bottom of the stack, or if anything does. Logic, epistemology and ontology have been held to be First Philosophy at different times. Logic, epistemology and ontology also seen to interact. Correct ontology depends on direct epistemology..but what minds are capable of knowing depends on ontology. Logic possibly depends on ontology too, since quantum.mechanics arguable challenges traditional bivalent logic.
Philosophers don’t embrace intuitions because they think they are particularly reliable,but because they have reasoned that they can’t do without them. (At least, the other options allowed by the MuNchausen trilemna, circulatory and regress, are distinctly unattractive )That is the essence of the Inconvenient Ineradicability of Intuition. An unfounded foundation is what philosophers mean by “intuition”...and not a supernatural insight that could not have been produced by a material brain
I blame science, math, engineering, entrepreneurship. Philosophy is the practice of the esoteric method, meaning it can’t be made truly legible for very long stretches of investigation. This results in accumulation of anti-epistemic hazards, which science doesn’t particularly need to have tools for dealing with, because it can filter its reasoning through frequent transitions into legibility.
Philosophy can’t rely on such filtering through legibility, it has to maintain sanity the hard way. But as philosophy enviously looks at the more successful endeavors of science, it doesn’t see respect for such methods of maintaining sanity in its reasoning, instead it sees that merely moving fast and breaking things works very well. And so the enthusiasm for their development wanes, instead philosophy remains content with the object level questions that investigate particular truths, rather than methods for getting better at telling which cognitive algorithms can more robustly arrive at truths (rationality, metaphilosophy).
I genuinely don’t know what you want elaboration of. Reacts are nice for what they are, but saying something out loud about what you want to hear more about / what’s confusing / what you did and didn’t understand/agree with, is more helpful.
Re/ “to whom not...”, I’m asking Wei: what groups of people would not be described by the list of 6 “underestimating the difficult of philosophy” things? It seems to me that broadly, EAs and “AI alignment” people tend to favor somewhat too concrete touchpoints like “well, suppressing revolts in the past has gone like such and such, so we should try to do similar for AGI”. And broadly they don’t credit an abstract argument about why something won’t work, or would only work given substantial further philosophical insight.
Re/ “don’t think thinking …”, well, if I say “LLMs basically don’t think”, they’re like “sure it does, I can keep prompting it and it says more things, and I can even put that in a scaffold” or “what concrete behavior can you point to that it can’t do”. Like, bro, I’m saying it can’t think. That’s the tweet. What thinking is, isn’t clear, but That thinking is should be presumed, pending a forceful philosophical conceptual replacement!
That is, in fact, a helpful elaboration! When you said
Most people who “work on AI alignment” don’t even think that thinking is a thing.
my leading hypotheses for what you could mean were:
Using thought, as a tool, has not occured to most such people
Most such people have no concept whatsoever of cognition as being a thing, the way people in the year 1000 had no concept whatsoever of javascript being a thing.
Now, instead, my leading hypothesis is that you mean:
Most such people are failing to notice that there’s an important process, called “thinking”, which humans do but LLMs “basically” don’t do.
This is a bunch more precise! For one, it mentions AIs at all.
As my reacts hopefully implied, this is exactly the kind of clarification I needed—thanks!
Like, bro, I’m saying it can’t think. That’s the tweet. What thinking is, isn’t clear, but That thinking is should be presumed, pending a forceful philosophical conceptual replacement!
Sure, but you’re not preaching to the choir at that point. So surely the next step in that particular dance is to stick a knife in the crack and twist?
That is -
“OK, buddy:
Here’s property P (and if you’re good, Q and R and...) that [would have to]/[is/are obviously natural and desirable to]/[is/are pretty clearly a critical part if you want to] characterize ‘thought’ or ‘reasoning’ as distinct from whatever it is LLMs do when they read their own notes as part of a new prompt and keep chewing them up and spitting the result back as part of the new prompt for itself to read.
Here’s thing T (and if you’re good, U and V and...) that an LLM cannot actually do, even in principle, which would be trivially easy for (say) an uploaded (and sane, functional, reasonably intelligent) human H could do, even if H is denied (almost?) all of their previously consolidated memories and just working from some basic procedural memory and whatever Magical thing this ‘thinking’/‘reasoning’ thing is.”
And if neither you nor anyone else can do either of those things… maybe it’s time to give up and say that this ‘thinking’/‘reasoning’ thing is just philosophically confused? I don’t think that that’s where we’re headed, but I find it important to explicitly acknowledge the possibility; I don’t deal in more than one epiphenomenon at a time and I’m partial to Platonism already. So if this ‘reasoning’ thing isn’t meaningfully distinguishable in some observable way from what LLMs do, why shouldn’t I simply give in?
I’ve had this tweet pinned to my Twitter profile for a while, hoping to find some like-minded people, but with 13k views so far I’ve yet to get a positive answer (or find someone expressing this sentiment independently):
Among my first reactions upon hearing “artificial superintelligence” were “I can finally get answers to my favorite philosophical problems” followed by “How do I make sure the ASI actually answers them correctly?”
Anyone else reacted like this?
This aside, there are some people around LW/rationality who seem more cautious/modest/self-critical about proposing new philosophical solutions, like MIRI’s former Agent Foundations team, but perhaps partly as a result of that, they’re now out of a job!
Having worked on some of the problems myself (e.g. decision theory), I think the underlying problems are just very hard. Why do you think they could have done “so much more, much more intently, and much sooner”?
The type of fundamental problem that proper speculative philosophy is supposed to solve is the sort where streetlighting doesn’t work (or isn’t working, or isn’t working fast enough). But nearly all of the alignment field after like 2004 was still basically streetlighting. It was maybe a reasonable thing to have some hope in prospectively, but retrospectively it was too much investment in streetlighting, and retrospectively I can make arguments about why one should have maybe guessed that at the time. By 2018 IIRC, or certainly by 2019, I was vociferously arguing for that in AF team meetings—but the rest of the team either disagreed with me or didn’t understand me, and on my own I’m just not that good a thinker, and I didn’t find anyone else to try it with. I think they have good thoughts, but are nevertheless mostly streetlighting—i.e. not trying to take step after step of thinking at the level of speculative philosophy AND aimed at getting the understanding needed for alignment.
My understanding of what happened (from reading this) is that you wanted to explore in a new direction very different from the then preferred approach of the AF team, but couldn’t convince them (or someone else) to join you. To me this doesn’t clearly have much to do with streetlighting, and my current guess is that it was probably reasonable of them to not be convinced. It was also perfectly reasonable of you to want to explore a different approach, but it seems unreasonable to claim without giving any details that it would have produced better results if only they had listened to you. (I mean you can claim this, but why should I believe you?)
If you disagree (and want to explain more), maybe you could either explain the analogy more fully (e.g., what corresponds to the streetlight, why should I believe that they overexplored the lighted area, what made you able to “see in the dark” to pick out a more promising search area or did you just generally want to explore the dark more) and/or try to convince me on the object level / inside view that your approach is or was more promising?
(Also perfectly fine to stop here if you want. I’m pretty curious on both the object and meta levels about your thoughts on AF, but you may not have wanted to get into such a deep discussion when you first joined this thread.)
Ok, so, there’s this thing about AGI killing everyone. And there’s this idea of avoiding that by making AGI that’s useful like an AGI but doesn’t kill everyone and does stuff we like. And you say you’re working on that, or want to work on that. And what you’re doing day to day is {some math thing, some programming thing, something about decision theory, …}. What is the connection between these things?
and then you listen to what they say, and reask the question and interrogate their answers, IME what it very often grounds out into is something like:
Well, I don’t know what to do to make aligned AI. But it seems like X ϵ {ontology, decision, preference function, NN latent space, logical uncertainty, reasoning under uncertainty, training procedures, negotiation, coordination, interoperability, planning, …} is somehow relevant.
And, I have a formalized version of some small aspect of X in which is mathematically interesting / philosophically intriguing / amenable to testing with a program, and which seems like it’s kinda related to X writ large. So what I’m going to do, is I’m going to tinker with this formalized version for a week/month/year, and then I’m going to zoom out and think about how this relates to X, and what I have and haven’t learned, and so on.
This is a good strategy because this is how all mathematical / scientific / technological progress is made: you start with stuff you know; you expand outwards by following veins of interest, tractability, and generality/power; you keep an eye roughly towards broader goals by selecting the broad region you’re in; and you build outward. What we see historically is that this process tends to lead us to think about the central / key / important / difficult / general problems—such problems show up everywhere, so we convergently will come to address them in due time. By mostly sticking, in our day-to-day work, to things that are relatively more concrete and tractable—though continually pushing and building toward difficult things—we make forward progress, sharpen our skills, and become familiar with the landscape of concepts and questions.
So I would summarize that position as endorsing streetlighting, in a very broad sense that encompasses most math / science / technology. And this position is largely correct! My claim is that
this is probably too slow for making Friendly AI, and
maybe one could go faster by trying to more directly cleave to the core philosophical problems.
(But note that, while that essay frames things as “a proposed solution”, the solution is barely anything—more like a few guesses at pieces of methodology—and the main point is the discussion of the problem; maybe a writing mistake.)
An underemphasized point that I should maybe elaborate more on: a main claim is that there’s untapped guidance to be gotten from our partial understanding—at the philosophical level and for the philosophical level. In other words, our preliminary concepts and intuitions and propositions are, I think, already enough that there’s a lot of progress to be made by having them talk to each other, so to speak.
[2.] maybe one could go faster by trying to more directly cleave to the core philosophical problems.
...
An underemphasized point that I should maybe elaborate more on: a main claim is that there’s untapped guidance to be gotten from our partial understanding—at the philosophical level and for the philosophical level. In other words, our preliminary concepts and intuitions and propositions are, I think, already enough that there’s a lot of progress to be made by having them talk to each other, so to speak.
OK but what would this even look like?\gen
Toss away anything amenable to testing and direct empirical analysis; it’s all too concrete and model-dependent.
Toss away mathsy proofsy approaches; they’re all too formalized and over-rigid and can only prove things from starting assumptions we haven’t got yet and maybe won’t think of in time.
Toss away basically all settled philosophy, too; if there were answers to be had there rather than a few passages which ask correct questions, the Vienna Circle would have solved alignment for us.
What’s left? And what causes it to hang together? And what causes it not to vanish up its own ungrounded self-reference?
What makes you think there are any such ‘answers’, renderable in a form that you could identify?
And even if they do exist, why do you think a human being could fully grasp the explanation in finite time?
Edit: It seems quite possible that even the simplest such ‘answers’ could require many years of full time effort to understand, putting it beyond most if not all human memory capacity. i.e. By the end even those who ‘learned’ it will have forgotten many parts near the beginning.
(Upvoted since your questions seem reasonable and I’m not sure why you got downvoted.)
I see two ways to achieve some justifiable confidence in philosophical answers produced by superintelligent AI:
Solve metaphilosophy well enough that we achieve an understanding of philosophical reasoning on par with mathematical reason, and have ideas/systems analogous to formal proofs and mechanical proof checkers that we can use to check the ASI’s arguments.
We increase our own intelligence and philosophical competence until we can verify the ASI’s reasoning ourselves.
Having finally experienced the LW author moderation system firsthand by being banned from an author’s posts, I want to make two arguments against it that may have been overlooked: the heavy psychological cost inflicted on a commenter like me, and a structural reason why the site admins are likely to underweight this harm and its downstream consequences.
(Edit: To prevent a possible misunderstanding, this is not meant to be a complaint about Tsvi, but about the LW system. I understand that he was just doing what he thought the LW system expected him to do. I’m actually kind of grateful to Tsvi to let me understand viscerally what it feels like to be in this situation.)
First, the experience of being moderated by an opponent in a debate inflicts at least the following negative feelings:
Unfairness. The author is not a neutral arbiter; they are a participant in the conflict. Their decision to moderate is inherently tied to their desire to defend their argument and protect their ego and status. In a fundamentally symmetric disagreement, the system places you at a profound disadvantage for reasons having nothing to do with the immediate situation. To a first approximation, they are as likely as you to be biased, so why do they get to be the judge?
Confusion. Consider the commenters who are also authors and manage their own threads through engagement, patience, tolerance, and a healthy dose of self-doubt. They rarely feel a need or desire to go beyond argumentation and voting (edit: at least on a platform like LW with mods pre-filtering users for suitability), so when they are deleted or banned, it creates a sense of bewilderment as to what they could have possibly done to deserve it.
Alienation. The feeling of being powerless to change the system, because so few people are like you, even in a community of people closest to you on Earth in ways of thinking. That you’re on an alien planet, or a mistake theorist surrounded by conflict theorists, with disengagement and self-imposed exile as the only ways out.
Second, this cost and its consequences are perhaps systematically underestimated because the admins are structurally immune to it. An author would almost never ban an admin, meaning they never (or rarely, perhaps on other platforms) experience these feelings. They, being authors themselves, see the author’s pain firsthand, but the commenter’s feelings are merely an abstract report at most. This seems like a source of bias that becomes obvious once pointed out, but doesn’t appear to have been made explicit before, at least as far as I’ve seen.
I don’t want to relitigate the policy in a balanced way at this point, but simply to introduce some potentially new considerations. So, admins, no need to respond now, but please keep these points in mind if you do decide to rethink the policy at some point.
It feels like there’s a confusion of different informal social systems with how LW 2.0 has been set up. Forums have traditionally had moderators distinct from posters, and even when moderators also participate in discussions on small forums, there are often informal conventions that a moderator should not put on a modhat if they are already participating in a dispute as a poster, and a second moderator should look at the post instead (you need more than one moderator for this of course).
The LW 2.0 author moderation system is what blog hosting platforms like Blogger and Substack use, and the bid seems to have been to entice people who got big enough to run their standalone successful blog back to Lesswrong. On these platforms the site administrators are very hands-off and usually only drop in to squash something actually illegal (and good luck getting anyone to talk to if they actually decide your blog needs to be wiped from the system), and the separate blogs are kept very distinct from each other with little shared site identity, so random very weird Blogger blogs don’t really create that much of an overall “there’s something off with Blogger” vibe. They just exist on their own domain and mostly don’t interact with the rest of the platform.
Meanwhile, LW is still very much in the forum mold, the posts exist in the same big pool and site moderators are very hands-on, give warnings and can be talked to. Standalone blog author tier people mostly don’t seem to have come back to post a large volume of LW threads, and the dynamics are still very forum-like, so basically now there’s just the chaotic extra element that any random person who started a forum thread can act as moderator and moderate other users as well as their individual comments on their threads, and this adds weird drama and dysfunction to the forum social dynamic. Most of the time it happens it’ll also violate the informal rule that a moderator should not start moderating the dispute they themselves got initially involved in as a non-modhat poster.
EDIT: The third system mixed in is Facebook/Twitter style social media that’s a “steppe” instead of a “valley”, meaning that you have a steady stream of complete strangers coming in and out instead of a pool of a few dozen to a few hundred people who might have been around for over a decade. You want a very low friction ban mechanism on a steppe site since a lot of first interactions will be bad and usually indicate the drive-by stranger they’re from is not worth interacting with. On a valley site the person interacting with you is much more likely to be tightly invested in the very local area, so blocking them is bigger drama generator.
This seems a good opportunity to let you know about an ongoing debate over the LW moderation system. rsaarelm’s comment above provides a particularly sharp diagnosis of the problem that many LWers see: author moderation imposes a “personal blog” moderation system onto a site that functions as a community forum, creating confusion, conflict, and dysfunction because the social norms of the two models are fundamentally at odds.
Even the site’s own admins seem confused. Despite defending the “blog” moderation model at every turn, the recently redesigned front-page Feed gives users no indication that by replying to a comment or post, they would be stepping into different “private spaces” with different moderators and moderation policies. It is instead fully forum-like.
Given the current confusions, we may be at a crossroads where LW can either push fully into the “personal blog” model, or officially revert back to the “forum” model that is still apparent from elements of the site’s design, and has plenty of mind share among the LW user base.
I suspect that when you made the original request for author moderation powers, it was out of intuitive personal preference. The site admins initially agreed to your request to entice you back to posting more on LW, but over the years developed a range of justifications for the system (that honestly appear to me more like rationalizations to support the original decision).
This history may be causing much of the current difficulties, because the admins may (perhaps subconsciously) worry that if they fully reevaluated the decision, it could lead to a repudiation of the system, which would necessitate going back on a commitment made to you. Therefore a reassessment from you, based in part on what we have learned living with the current “hybrid” moderation system over the past 8 years, could be invaluable in prompting the admins to also reconsider the system without the historical baggage.
To be clear, I think the above paragraph has a <50% chance of happening, and I may well be totally off in my inference of how the current system and its justifications came into being, but it seems worth checking with you, just in case.
It’s indeed the case that I haven’t been attracted back to LW by the moderation options that I hoped might accomplish that. Even dealing with Twitter feels better than dealing with LW comments, where people are putting more effort into more complicated misinterpretations and getting more visibly upvoted in a way that feels worse. The last time I wanted to post something that felt like it belonged on LW, I would have only done that if it’d had Twitter’s options for turning off commenting entirely.
So yes, I suppose that people could go ahead and make this decision without me. I haven’t been using my moderation powers to delete the elaborate-misinterpretation comments because it does not feel like the system is set up to make that seem like a sympathetic decision to the audience, and does waste the effort of the people who perhaps imagine themselves to be dutiful commentators.
because it does not feel like the system is set up to make that seem like a sympathetic decision to the audience
Curious whether you have any guesses on what would make it seem like a sympathetic decision to the audience. My model here is that this is largely not really a technical problem, but more of a social problem (which is e.g. better worked towards by things like me writing widely read posts on moderation), though I still like trying to solve social problems with better technical solutions and am curious whether you have ideas (that are not “turn off commenting entirely”, which I do think is a bad idea for LW in particular).
I’m not sure what Eliezer is referring to, but my guess is that many of the comments that he would mark as “elaborate-misinterpretations”, I would regard as reasonable questions / responses, and I would indeed frown on Eliezer just deleting them. (Though also shrug, since the rules are that authors can delete whatever comments they want.)
Some examples that come to mind are this discussion with Buck and this discussion with Matthew Barnett, in which (to my reading of things) Eliezer seems to be weirdly missing what the other person is saying at least as much as they are missing what he is saying.
I from the frustration Eliezer expressed in those threads, I would guess that he would call these elaborate-misinterpretations.
My take is that there’s some kind of weird fuckyness about communicating about some of these topics where both sides feel exasperation that the other side is apparently obstinately mishearing them. I would indeed think it would be worse if the post author in posts like that just deleted the offending comments.
I currently doubt the Buck thread would qualify as such from Eliezer’s perspective (and agree with you there that in as much as Eliezer disagrees, he is wrong in that case).
IMO I do think it’s a pretty bad mark on LW’s reputation that posts like Matthew’s keep getting upvoted, with what seem to me like quite aggressively obtuse adversarial interpretations of what people are saying.
The existence of the latter unfortunately makes the former much harder to navigate.
I’m guessing that there are a lot enough people like me, who have such a strong prior on “a moderator shouldn’t mod their own threads, just like a judge shouldn’t judge cases involving themselves”, plus our own experiences showing that the alternative of forum-like moderation works well enough, that it’s impossible to overcome this via abstract argumentation. I think you’d need to present some kind of evidence that it really leads to better results than the best available alternative.
I’m guessing that there are a lot of people like me, who have such a strong prior on “a moderator shouldn’t mod their own threads, just like a judge shouldn’t judge cases involving themselves”
Nowhere on the whole wide internet works like that! Clearly the vast majority of people do not think that authors shouldn’t moderate their own threads. Practically nowhere on the internet do you even have the option for anything else.
Nowhere on the whole wide internet works like that! Clearly the vast majority of people do not think that authors shouldn’t moderate their own threads. Practically nowhere on the internet do you even have the option for anything else.
Where’s this coming from all of a sudden? Forums work like this, Less Wrong used to work like this. Data Secrets Lox still works like this. Most subreddits work like this. This whole thread is about how maybe the places that work like this have the right idea, so it’s a bit late in the game to open up with “they don’t exist and aren’t a thing anyone wants”.
Yes, Reddit is one of the last places on the internet where this is semi-common, but even there, most subreddits are moderated by people who are active posters, and there are no strong norms against moderators moderating responses to their own comments or posts.
I agree I overstated here and that there are some places on the internet where this is common practice, but it’s really a very small fraction of the internet these days. You might bemoan this as a fate of the internet, but it’s just really not how most of the world thinks content moderation works.
There is actually a significant difference between “Nowhere on the whole wide internet works like that!” and “few places work like that”. It’s not just a nitpick, because to support my point that it will be hard for Eliezer to get social legitimacy for freely exercising author mod power, I just need that there is a not too tiny group of people on the Internet who still prefers to have no author moderation (it can be small in absolute numbers, as long as it’s not near zero, since they’re likely to congregate at a place like LW that values rationality and epistemics). The fact that there are still even a few places on the Internet that works like this makes a big difference to how plausible my claim is.
I mean, I think no, if truly there is only a relatively small fraction of people like that around, we as the moderators can just ask those people to leave. Like, it’s fine if we have to ask hundreds of people to leave, the world is wide and big. If most of the internet is on board with not having this specific stipulation, then there is a viable LessWrong that doesn’t have those people.
No, I don’t “need” to do that. This is (approximately) my forum. If anything you “need” to present some kind of evidence that bridges the gap here! If you don’t like it build your own forum that is similarly good or go to a place where someone has built a forum that does whatever you want here.
The point of the post is not to convince everyone, there was never any chance of that, it’s to build enough shared understanding that people understand the principles of the space and can choose to participate or leave.
Ok I misunderstood your intentions for writing such posts. Given my new understanding, will you eventually move to banning or censoring people for expressing disapproval of what they perceive as bad or unfair moderation, even in their own “spaces”? I think if you don’t, then not enough people will voluntarily leave or self-censor such expressions of disapproval to get the kind of social legitimacy that Eliezer and you desire, but if you do, I think you’ll trigger an even bigger legitimacy problem because there won’t be enough buy-in for such bans/censorship among the LW stakeholders.
If you don’t like it build your own forum that is similarly good or go to a place where someone has built a forum that does whatever you want here.
This is a terrible idea given the economy of scale in such forums.
Given my new understanding, will you eventually move to banning or censoring people for expressing disapproval of what they perceive as bad or unfair moderation, even in their own “spaces”?
I mean, I had a whole section in the Said post about how I do think it’s a dick move to try to socially censure people for using any moderation tools. If someone keeps trying to create social punishment for people doing that, then yeah, I will ask them to please do that somewhere else but here, or more likely, leave the content up but reduce the degree to which things like the frontpage algorithm feed attention to it. I don’t know how else any norms on the site are supposed to bottom out.
Top-level posts like this one seem totally fine. Like, if someone wants to be like “I am not trying to force some kind of social punishment on anyone, but I do think there is a relevant consideration here, but I also understand this has been litigated a bunch and I am not planning to currently reopen that”, then that’s fine. Of course you did kind of reopen it, which to be clear I think is fine on the margin, but yeah, I would totally ask you to stop if you did that again and again.
I mean, I had a whole section in the Said post about how I do think it’s a dick move to try to socially censure people for using any moderation tools.
I think an issue you’ll face is that few people will “try to socially censure people for using any moderation tools”, but instead different people will express disapproval of different instances of perceived bad moderation, which adds up to that a large enough share of all author moderation gets disapproved of (or worse blow up into big dramas), such that authors like Eliezer do not feel there’s enough social legitimacy to really use them.
(Like in this case I’m not following the whole site and trying to censure anyone who does author moderation, but speaking up because I myself got banned!)
And Eliezer’s comment hints why this would happen: the comments he wants to delete are often highly upvoted. If you delete such comments, and the mod isn’t a neutral third party, of course a lot of people will feel it was wrong/unfair and want to express disapproval, but they probably won’t be the same people each time.
How are you going to censor or deprioritize such expressions of disapproval? By manual mod intervention? AI automation? Instead of going to that trouble and cause a constant stream of resentment from people feeling wronged and silenced, it seems better for Eliezer to just mark the comments that misinterpret him as misinterpretations (maybe through the react system or a more prominent variation of it, if he doesn’t want to just reply to each one and say “this is a misinterpretation). One idea is reacts from the OP author are distinguished or more prominently displayed somehow.
I think an issue you’ll face is that few people will “try to socially censure people for using any moderation tools”,
No, my guess is this is roughly the issue. I think the vast majority of complaints here tend to be centered in a relatively small group of people who really care.
It’s not a particularly common expectation that people have about how the internet works, as I have said in other places in this thread. I don’t think the rest of the internet gets these kinds of things right, but I also don’t think that there will be an unquenchable torrent of continuous complaints that will create a landscape of perpetual punishment for anyone trying to use moderation tools.
I think if you resolve a few disagreements, and moderate a relatively small number of people, you end up at an equlibrium that seems a bunch saner to me.
The rest of the Internet is also not about rationality though. If Eliezer started deleting a lot of highly upvoted comments questioning/criticizing him (even if based on misinterpretations like Eliezer thinks), I bet there will be plenty of people making posts like “look at how biased Eliezer is being here, trying to hide criticism from others!” These posts themselves will get upvoted quite easily, so this will be a cheap/easy way to get karma/status, as well as (maybe subconsciously) getting back at Eliezer for the perceived injustice.
I don’t know if Eliezer is still following this thread or not, but I’m also curious why he thinks there isn’t enough social legitimacy to exercise his mod powers freely, whether its due to a similar kind of expectation.
I mean, yes, these dynamics have caused many people, including myself, to want to leave LessWrong. It sucks. I wish people stopped. Not all moderation is censorship. The fact that it universally gets treated as such by a certain population of LW commenters is one of the worst aspects of this site (and one of the top reasons why in the absence of my own intervention into reviving the site, this site would likely no longer exist at all today).
I think we can fix it! I think it unfortunately takes a long time, and continuous management and moderation to slowly build trust that indeed you can moderate things without suddenly everyone going insane. Maybe there are also better technical solutions.
Claiming this is about “rationality” feels like mostly a weird rhetorical move. I don’t think it’s rational to pretend that unmoderated discussion spaces somehow outperform moderated ones. As has been pointed out many times, 4Chan is not the pinnacle of internet discussion. Indeed, I think largely across the internet, more moderation results in higher trust and higher quality discussions (not universally, you can definitely go on a censorious banning spree as a moderator and try to skew consensus in various crazy ways, but by and large, as a correlation).
This is indeed an observation so core to LessWrong that Well-Kept Gardens Die By Pacifism was, as far as I can tell, a post necessary for LessWrong to exist at all.
I’m not saying this, nor are the hypothetical people in my prediction saying this.
Claiming this is about “rationality” feels like mostly a weird rhetorical move.
We are saying that there is an obvious conflict of interest when an author removes a highly upvoted piece of criticism. Humans being biased when presented with COIs is common sense, so connecting such author moderation with rationality is natural, not a weird rhetorical move.
The rest of your comment seems to be forgetting that I’m only complaining about authors having COI when it comes to moderation, not about all moderation in general. E.g. I have occasional complaints like about banning Said, but generally approve of the job site moderators are doing on LW. Or if you’re not forgetting this, then I’m not getting your point. E.g.
I don’t think it’s rational to pretend that unmoderated discussion spaces somehow outperform moderated ones.
I have no idea how this related to my actual complaint.
We are saying that there is an obvious conflict of interest when an author removes a highly upvoted piece of criticism. Humans being biased when presented with COIs is common sense, so connecting such author moderation with rationality is natural, not a weird rhetorical move.
Look, we’ve had these conversations.
I am saying the people who are moderating the spaces have the obvious information advantage about their own preferences and about what it’s actually like to engage with an interlocutor, plus the motivation advantage to actually deal with it. “It’s common sense that the best decisions get made by people with skin in the game and who are most involved with the actual consequences of the relevant decision”. And “it’s common sense that CEOs of organizations make hiring and firing decisions for the people they work with, boards don’t make good firing decisions, the same applies to forums and moderators”.
This is a discussion as old as time in business and governance and whatever. Framing your position as “common sense” is indeed just a rhetorical move, and I have no problem framing the opposite position in just as much of an “obvious” fashion. Turns out, neither position obviously dominates by common sense! Smart people exist on both sides of this debate. I am not against having it again, and I have my own takes on it, but please don’t try to frame this as some kind of foregone conclusion in which you have the high ground.
The rest of your comment seems to be forgetting that I’m only complaining about authors having COI when it comes to moderation, not about all moderation in general.
I was (and largely am) modeling you as being generically opposed to basically any non-spam bans or deletions on the site. Indeed, as I think we’ve discussed, the kind of positions that you express in this thread suggest to me that you should be more opposed to site-wide bans than author bans (since site-wide bans truly make counterveiling perspectives harder to find instead of driving them from the comment sections to top-level posts).
If you aren’t against site-wide bans, I do think that’s a pretty different situation! I certainly didn’t feel like I was empowered to moderate more in our conversations on moderation over the last year. It seemed to me you wanted both less individual author moderation, and less admin moderation for anything that isn’t spam. Indeed, I am pretty sure, though I can’t find it, that you said that LW moderation really should only establish a very basic level of protection against spam and basic norms of discourse, but shouldn’t do much beyond that, but I might be misremembering.
If you do support moderation, I would be curious about you DMing me some example of users you think we should ban, or non-spam comments we should delete. My current model of you doesn’t really think those exist.
I think you’re right that I shouldn’t have latched onto the first analogy I thought of. Here’s a list of 11 (for transparency, analogies 3-10 were generated by Gemini 3.0 Pro, though some may have appeared in previous discussions.):
The CEO & The Corporation
The Judge & The Courtroom
The Dinner Party Host
The University Classroom / Professor
The Conference Breakout Session
Open Source / GitHub Maintainer
The Stand-Up Comedian & The Heckler
The Art Gallery Opening
Graffiti on a Private House
The Town Hall vs Private Meetings
The Hypothetical HOA
I decided to put detailed analysis of these analogies in this collapsed section, as despite extensive changes by me from the original AI-generated text, it doesn’t quite read like my style. Also, it might be too much text and my summary/conclusions below may be sufficient to convey the main points.
1. The CEO & The Corporation
Analogy: A Forum Post is a “Project.” The Author is the CEO; the Commenter is an Employee. The CEO needs the power to fire employees who disrupt the vision, and the Board (Admins) should defer to the CEO’s judgment.
Disanalogy: In a corporation, the Board cannot see daily operations, creating information asymmetry; on a forum, Admins see the exact same content as the Author. A CEO has a smaller conflict of interest when firing an employee, because they are judged primarily by the company’s financial performance rather than the perception of their ideas. If they fire an employee who makes a good criticism, they might subsequently look better to others, but the company’s performance will suffer.
Conclusion: The analogy fails because the Author lacks the financial alignment of a CEO and possesses no special private information that the Admins lack.
2. The Judge & The Courtroom
Analogy: When there is a conflict in the physical world, we find disinterested parties to make enforceable judgments, even if the cost is very high. When the cost is too high, we either bear it (wait forever for a trial date) or give up the possibility of justice or enforcement, rather than allow an interested party to make such judgments.
Disanalogy: A courtroom has the power of Coercion (forcing the loser to pay, go to jail, or stop doing something). A Forum Author only has the power of Dissociation (refusing to host the commenter’s words). We require neutral judges to deprive people of rights/property; we do not require neutral judges to decide who we associate with.
Conclusion: Dissociation has its own externalities (e.g., hiding of potentially valuable criticism), which we usually regulate via social pressure, or legitimize via social approval, but you don’t want this and therefore need another source of legitimacy.
3. The Dinner Party Host
Analogy: A Post is a private social gathering. The Author is the Host. The Host can kick out a guest for any reason, such as to curate the conversation to his taste.
Disanalogy: In the real world, if a Host kicks out a guest that everyone else likes, the other attendees would disapprove and often express such disapproval. There is no mechanism to then suppress such disapproval, like you seek.
Conclusion: You want the power of the Host without the social accountability that naturally regulates a Host’s behavior.
4. The University Classroom / Professor
Analogy: The Author is a Subject Matter Expert (Professor). The Commenter is a Student. The Dean (Admin) lets the Professor silence students to prevent wasting class time.
Disanalogy: A classroom has a “scarce microphone” (only one person can speak at a time); a forum has threaded comments (parallel discussions), so the “Student” isn’t stopping the “Professor” from teaching. Additionally, LessWrong participants are often peers, not Student/Teacher.
Conclusion: The justification for silencing students (scarcity of time/attention, asymmetry of expertise) does not apply to LW.
5. The Conference Breakout Session
Analogy: The Author is like an Organizer who “rented the room” at a convention. The Organizer has the right to eject anyone to accomplish his goals.
Disanalogy: Just like the Dinner Party, an Organizer would almost never eject someone who is popular with their table. If they did, the table would likely revolt.
Conclusion: This analogy fails to justify the action of overriding the local consensus (upvotes) of the participants in that sub-thread.
6. Open Source / GitHub Maintainer
Analogy: A Post is a Code Repository. A Comment is a Pull Request. The Maintainer has the absolute right to close a Pull Request as “Wontfix” or “Off Topic” to keep the project focused.
Disanalogy: In Open Source, a rejected Pull Request is Closed, not Deleted. The history remains visible, easy to find, and auditable. Also, this situation is similar to the CEO in that the maintainer is primarily judged on how well their project works, with the “battle of ideas” aspect a secondary consideration.
Conclusion: You are asking for more power for an Author than a Maintainer, and a Maintainer has less COI for reasons similar to a CEO.
7. The Stand-Up Comedian & The Heckler
Analogy: The Author is a Comedian. The Commenter is a Heckler. Even if the Heckler is funny (Upvoted), they are stealing the show. The Club (Admins) protects the Comedian because writing a set is high-effort.
Disanalogy: In a physical club, the Heckler interrupts the show. In a text forum, the comment sits below the post. The audience can consume the Author’s “set” without interference before reading the comment.
Conclusion: The physical constraints that justify silencing a heckler do not exist in a digital text format.
8. The Art Gallery Opening
Analogy: The Post is a Painting. The Upvoted Comment is a Critic framing the art negatively. The Artist removes the Critic to preserve the intended Context of the work.
Disanalogy: Art is about aesthetics and subjective experience. LessWrong is ostensibly about intellectual progress and truth-seeking.
Conclusion: Prioritizing “Context” over “Criticism” serves goals that are not LW’s.
9. Graffiti on a Private House
Analogy: A Post is the Author’s House. A Comment is graffiti. The homeowner has the right to scrub the wall (Delete) so neighbors don’t see it.
Disanalogy: This is purely about property value and aesthetics.
Conclusion: Again the goals are too different for the analogy to work.
10. The Town Hall vs Private Meetings
Analogy: In the real world we have both town halls (Neutral Moderator) and meetings in private houses (Author Control). We can have both.
Disanalogy: Even in the discussions inside a private house, social norms usually prevent a host from kicking out a guest who is making popular points that everyone else agrees with.
Conclusion: The social legitimacy that you seek doesn’t exist here either.
11. The Hypothetical HOA
Analogy: A hypothetical residential community with HOA rules that say, a homeowner not only has the right to kick out any guests during meetings/parties, but no one is allowed to express disapproval for exercising such powers. Anyone who buys a house in the community is required to sign the HOA agreement.
Disanalogy: There are already many people in the LW community who never “signed” such agreements.
Conclusion: You are proposing to ask many (“hundreds”) of the existing “homeowners” (some of whom have invested years of FTE work into site participation) to leave, which is implausible in this hypothetical analogy.
Overall Conclusions
None of the analogies are perfect, but we can see some patterns when considering them together.
Neutral, disinterested judgement is a standard social technology for gaining legitimacy. In the case of courts, it is used to legitimize coercion, an otherwise illegitimate activity that would trigger much opposition. In the case of a forum, it can be used to legitimize (or partly legitimize) removing/hiding/deprioritizing popular/upvoted critiques.
Some analogies provide a potential new idea for gaining such legitimacy in some cases: relatively strong and short external feedback loops like financial performance (for the CEO) and real-world functionality (for the open source maintainer) can legitimize greater unilateral discretion. This can potentially work on certain types of posts, but most lack such short-term feedback.
In other cases, suppression of dissent is legitimized for specific reasons clearly not applicable to LW, such as clear asymmetry of expertise between speaker and audience, or physical constraints.
In the remaining cases, the equivalent of author moderation (e.g., kicking out a houseguest) is legitimized only by social approval, but this is exactly what you and Eliezer want to avoid.
Having gone through all of these possible analogies, I think my intuition for judges/courts being the closest analogy to moderation is correct after all: in both cases, disinterested judgement seems to be the best or only way to gain social legitimacy for unpopular decisions.
However, this exercise also made me realize that in most of the real world we do allow people to unilaterally exercise the power of dissociation, as long as it’s regulated by social approval or disapproval, and this may be a reasonable prior for LW.
Perhaps the strongest argument (for my most preferred policy of no author moderation, period) at this point is that unlike the real world, we lack clear boundaries to signal when we are entering a “private space”, nor is it clear how much power/responsibility the authors are supposed to have, with the site mods also being around. The result is a high cost of background confusion (having to track different people’s moderation policies/styles or failing to do so and being surprised) as well as high probability of drama/distraction whenever it is used, because people disagree or are confused about the relevant norms.
On the potential benefits side, the biggest public benefits of moderation can only appear when it’s against the social consensus, otherwise karma voting would suffice as a kind of moderation. But in this case clearly social approval can’t be a source of legitimacy, and if disinterested judgment and external feedback are also unavailable as sources of legitimacy, then it’s hard to see what can work. (Perhaps worth reemphasizing here, I think this intuitive withholding of legitimacy is correct, due to the high chance of abuse when none of these mechanisms are available.) This leaves the private psychological benefit to the author, which is something I can’t directly discuss (due to not having a psychology that wants to “hard” moderate others), and can only counter with the kind of psychological cost to author-commenters like myself, as described in the OP.
@Ben Pace I’m surprised that you’re surprised. Where did your impression that I generally disapprove of the job site moderators are doing on LW come from, if you can recall?
In the last year I’d guess you’ve written over ten thousand words complaining about LW moderation over dozens of comments, and I don’t recall you ever saying anything positive about the moderation? I recall once said that you won’t leave the site over our actions (so far), which sounds like you’ll bear our moderation, but is quite different from saying it’s overall good.
Thanks, to clarify some more in case it’s helpful, I think I’ve only complained about 2 things, the Said banning and the author moderation policy, and the word count was just from a lot of back and forth, not the number issues I’ve had with the mod team? A lot of what you do is just invisible to me, like the user pre-filtering that habryka mentioned and the routine moderation work, but I assume you’re doing a good job on them, as I’m pretty happy with the general LW environment as far as lack of spam, generally good user behavior, and not seeing many complaints about being unfairly moderated by the mod team, etc.
Found my quote about not leaving:
My response to this is that I don’t trust people to garden their own space, along with other reasons to dislike the ban system. I’m not going to leave LW over it though, but just be annoyed and disappointed at humanity whenever I’m reminded of it.
Yeah I think you misinterpreted it. I was just trying to say that unlike those who got what they wanted (the author mod policy) by leaving or threatening to leave, I’m explicitly not using this threat as a way to get what I want. It was a way to claim the moral high ground I guess. Too bad the message misfired.
rsaarelm gave an excellent explanation early on about how the issue seems to be an incompatibility between forum mechanics and blog mechanics, rather than an issue with moderation itself. It would be unfortunate if the point was overlooked because it misunderstood as “moderation is bad”.
It is fair to say that a blog with a policy “I’ll moderate however I like, if you don’t like it leave” works fine. It’s the default and implicit.
When it comes to a forum system with as many potential posters as there are commenters then “If you don’t like it leave” is the implicit ultimatum from every single user to every other. But if the feed system that governs content exposure doesn’t allow leaving individual posters, then the only thing that could be left is the entire forum.
This is why all other significant sites with a many producers → many consumers model all have unsubscribe, mute and/or block features. It helps ensure a few weeds in the Well-Kept Garden don’t drive away all the plants with low toxin tolerance.
It sounds like—particularly from testimony from habryka and Eliezer—moving to a more meta-blog like system is/was critical to lesswrong being viable. Which means leaning in to that structure and fully implementing the requisite features seems like an easy way to improve the experience of everyone.
I think you’d need to present some kind of evidence that it really leads to better results than the best available alternative.
I am perhaps misreading, but think this sentence should be interpreted as “if you want to convince [the kind of people that I’m talking about], then you should do [X, Y, Z].” Not “I unconditionally demand that you do [X, Y, Z].”
This comment seems like a too-rude response to someone who (it seems to me) is politely expressing and discussing potential problems. The rudeness seems accentuated by the object level topic.
Curious whether you have any guesses on what would make it seem like a sympathetic decision to the audience
Off-the-cuff idea, probably a bad on:
Stopping short of “turning off commenting entirely”, being able to make comments to a given post subject to a separate stage of filtering/white-listing. The white-listing criteria are set by the author and made public. Ideally, the system is also not controlled by the author directly, but by someone the author expects to be competent at adhering to those criteria (perhaps an LLM, if they’re competent enough at this point).
The system takes direct power out of the author’s hands. They still control the system’s parameters, but there’s a degree of separation now. The author is not engaging in “direct” acts of “tyranny”.
It’s made clear to readers that the comments under a given post have been subject to additional selection, whose level of bias they can estimate by reading the white-listing criteria.
The white-listing criteria are public. Depending on what they are, they can be (a) clearly sympathetic, (b) principled-sounding enough to decrease the impression of ad-hoc acts of tyranny even further.
(Also, ideally, the system doing the selection doesn’t care about what the author wants beyond what they specified in the criteria, and is thus an only boundedly and transparently biased arbiter.)
The commenters are clearly made aware that there’s no guarantee their comments on this post will be accepted, so if they decide to spend time writing them, they know what they’re getting into (vs. bitterness-inducing sequence where someone spends time on a high-effort comment that then gets deleted).
There’s no perceived obligation to respond to comments the author doesn’t want to respond to, because they’re rejected (and ideally the author isn’t even given the chance to read them).
There are no “deleting a highly-upvoted comment” events with terrible optics.
Probably this is still too censorship-y, though? (And obviously doesn’t solve the problem where people make top-level takedown posts in which all the blacklisted criticism is put and then highly upvoted. Though maybe that’s not going to be as bad and widespread as one might fear.)
This history may be causing much of the current difficulties, because the admins may (perhaps subconsciously) worry that if they fully reevaluated the decision, it could lead to a repudiation of the system, which would necessitate going back on a commitment made to you.
Look, I know I wouldn’t want to use LessWrong if I couldn’t ban annoying commenters from my posts. I wouldn’t run LessWrong if I didn’t have the ability to ban annoying commenters from the site.
Yes, if there were trusted enough moderators who would do the job well-enough for me, then I could get away without moderation tools when posting, but currently, if I was an external party, I would not trust even the current LW team with a copy of myself to do this. The LW team is trying to allow a bunch of people with incompatible conversation styles to be on the site, and this means that I cannot expect site-wide bans and deletion to be sufficient for making it so I don’t have run-ins with a bunch of people who will very predictably waste hours of my time.
Like, de-facto we try to compensate a bunch for this by asking people what kind of moderation they would like on their post, and then try to help them with moderation, but it’s a labor intensive process that mostly fails for bandwidth reasons.
And look, it sucks. You show up with a top-level thread where you say:
I don’t want to relitigate the policy in a balanced way at this point, but simply to introduce some potentially new considerations. So, admins, no need to respond now, but please keep these points in mind if you do decide to rethink the policy at some point.
And then you just fully ignore this intention, and tag a bunch of people to please weigh in on a dispute, absolutely asking me to respond. Like, sure, you can open up this thread again, and every time anyone takes any moderation decision ever on the site, but please at least don’t say that you are not going to do it, and then totally do it.
I wouldn’t run LessWrong if I didn’t have the ability to ban annoying commenters from the site.
I’m totally fine with site moderators moderating in an disinterested way. It’s the “moderators modding their own thread” thing that I’m objecting to. Kind of strange that you’re still writing as if you don’t get this.
And then you just fully ignore this intention, and tag a bunch of people to please weigh in on a dispute, absolutely asking me to respond. Like, sure, you can open up this thread again, and every time anyone takes any moderation decision ever on the site, but please at least don’t say that you are not going to do it, and then totally do it.
Yeah, I realize I’m being inconsistent. That was my original intention, but @rsaarelm subsequently came in with a really clear and concise description of the core problem, and it seemed like too good of an opportunity not to pivot a bit. (I don’t think I tagged anyone besides Eliezer though?) Of course I don’t blame you for wanting to respond in this situation.
I’m totally fine with site moderators moderating in an disinterested way. It’s the “moderators modding their own thread” thing that I’m objecting to. Kind of strange that you’re still writing as if you don’t get this.
I mean, that’s what my previous sentence was about. The logical structure was implicitly “Look, I know I wouldn’t want to use LessWrong if I couldn’t ban annoying commenters from my posts because I also wouldn’t run LessWrong if I didn’t have the ability to ban annoying commenters from the site.”
And then I continued explaining that trusting site-moderators to handle every case for me is a tricky task that I don’t expect site moderators are capable of. Indeed, it’s not even clear a copy of me could handle it, because global governance is indeed just different from local governance, and it’s very hard to wear both hats at the same time.
Yeah, I realize I’m being inconsistent
Cool, if we are on the same page that there was a shift here, I feel better about this. I agree there was an opportunity there (though I disagree that rsaarelm’s post was a particularl good summary of the situation, but it seems fine for you to believe that).
I think it kind of sucks that this did indeed invalidate my ability to trust that threads like this aren’t threads I have to follow in great detail and managed in costly ways, which then makes discussions about this kind of stuff worse in the future.
Even the site’s own admins seem confused. Despite defending the “blog” moderation model at every turn, the recently redesigned front-page Feed gives users no indication that by replying to a comment or post, they would be stepping into different “private spaces” with different moderators and moderation policies. It is instead fully forum-like.
The intended UI here is the same as what happens on current blogposts, where these guidelines show up at the bottom of the commenting box. It’s a bug it doesn’t show up, and my guess is we’ll fix it this week.
Like, I think this is a valuable piece of information, but definitely not one I would consider crucial. If you go to Twitter[1], or Facebook or any part of the rest of the internet, there is a universal assumption that the author of the post you are commenting on has moderation privileges on that post. People understand that a feed is a conglomerate of content from lots of different places. “No indication” is just false, it’s about the same level of indication as the whole rest of the internet has.
In case of Twitter an author can hide any tweets made by anyone else, which is not quite like deleting, but close enough for the purpose of this discussion
“Fundamentally at odds” seems way too strong to me, so I assume that I’m missing something.
Even the site’s own admins seem confused. Despite defending the “blog” moderation model at every turn, the recently redesigned front-page Feed gives users no indication that by replying to a comment or post, they would be stepping into different “private spaces” with different moderators and moderation policies. It is instead fully forum-like.
In practice, Authors utilize their moderation privileges so rarely that there seems to be no difference whatsoever between the user experience in the “full-forum model” vs. the “private spaces” model? Like the difference has never bothered me, or impacted my behavior at all?
And indeed, I think that authors rarely use their moderation privileges because the private spaces on LessWrong are are built on top a platform that runs on the forum model. Authors don’t need to aggressively moderate their posts, because the LessWrong mod team does a lot of the work that would otherwise need to be done by the authors. In practice, we’re mostly relying on the forum model, but with an extra, rarely invoked, layer of the “private spaces” model for handling some exceptional cases. Which overall seems to work just fine to me?
Also, the author’s moderation policies are displayed just below the text box every time you post a comment? That seems importantly different than “no indication”, unless I’m misunderstanding what you mean.
Overall, I don’t get why this feels like such a big deal to you, yet.
“Fundamentally at odds” seems way too strong to me, so I assume that I’m missing something.
What I mean is that on forums there is an expectation that mods will be fair to all sides, and this is in part achieved by the rule of not modding their own threads. If you feel like a mod abused their power (e.g. deleted content that didn’t violate the rules) you can often appeal to the other mods.
On a personal blog such expectations do not exist. If you get modded however unfairly, you just suck it up and move on. Since these expectations are totally opposite, when you mix the two models together on LW it becomes very confusing what one should expect.
It could be that “fundamentally at odds” is worded too strongly though. Let me know what you think given the above.
In practice, Authors utilize their moderation privileges so rarely that there seems to be no difference whatsoever between the user experience in the “full-forum model” vs. the “private spaces” model? Like the difference has never bothered me, or impacted my behavior at all?
Right, but the site admins are trying to encourage people to use it more, so I thought I’d report my own experience of being author-banned, as a warning to them. I also think if they’re not used more, then the author mod powers should just be removed, to fix the above mentioned confusion, which is there regardless of how much the powers are actually used.
Also, the author’s moderation policies are displayed just below the text box every time you post a comment? That seems importantly different than “no indication”, unless I’m misunderstanding what you mean.
I’m talking specifically about the comment boxes in the Feed section of the front page, which do not have such policies displayed, as of this writing.
The LW 2.0 author moderation system is what blog hosting platforms like Blogger and Substack use, and the bid seems to have been to entice people who got big enough to run their standalone successful blog back to Lesswrong.
I think it was also a desire to get people who liked a steppe style system to post. In particular, I recall Eliezer saying that he wanted a system similar to his Facebook page, where he can just ban an annoying commenter with a couple of clicks and be done with it.
Thank you, this seems like a very clear and insightful description of what is confusing and dysfunctional about the current situation.
To add some of my personal thoughts on this, the fact that the Internet always had traditional forums with the forum model of moderation shows that model can work perfectly well, and there is no need for LW to also have author moderation, from a pure moderation (as opposed to attracting authors) perspective. And “standalone blog author tier people” not having come back in 8 years since author mod was implemented means it’s time to give up on that hope.
LW is supposed to be a place for rationality, and the forum model of moderation is clearly better for that (by not allowing authors to quash/discourage disagreement or criticism). “A moderator shouldn’t mod their own threads” is such an obviously good rule and widely implemented on forums, that sigh… I guess I’ll stop here before I start imputing impure motives to the site admins again, or restart a debate I don’t really want to have at this point.
I stopped by to lesswrong for the first time in a decade and (due to the familiar author) this was the first post that caught my attention in my feed. I’m shocked.
The new policy: * Allows post authors to suppress any disagreement or correction at will.
* STILL does not allow blocking users * Force feeds these private and immune from criticism posts on all participants in a collective feed, with no way to opt out, mute, or block any egregious abusers.
This is a feature combination that isn’t seen on any successful site. For good reason. As others have mentioned sites where authors control participation always rely on readers being able to opt in, then opt out if the author deviates from good faith contribution (in the reader’s view). Forums where you cannot opt out do not allow conflicts of interest in moderation. Anyone moderating on their own disagreements is (and should be) viewed with suspicion.
Honestly, even the refusal to allow users to block and mute each other has always been a toxic trait of lesswrong. But being force fed posts and comments from bad faith debaters with no ability to refute, block or mute would make for a site even worse than x.com, at least in terms of technology for facilitating healthy conversation.
It should have been immediately obvious that as soon as posts become author controlled spaces, readers must be able to choose which authors they follow. How was this able to happen? Were there no adults in the room?
hi, just registering that Tsvi warned me before using mod tools recently, I updated, I was not banned, and it seemed fine. I generally think your reaction to this is reasonable and understandable, I was stressed by getting warned, but I tried to react in a way that would result in not getting banned. I think that the effect on how I write is not going to be that I censor myself, just that I think more before commenting on some people’s posts.
I do think there’s such a thing as giving a moderator too much power over a situation. but I also want to register that the noise any time this happens is frustrating and leads me to want features related to muting drama-related topics.
that said, I’m interested in what you think of bluesky’s model, where blocking someone hides their replies from your threads, but nobody can delete each others’ replies, so the replies are still out there and can and do turn into separate discussions when the people involved in the reply are still interested in them. in my opinion it’s a bit too strong because it means gentle disagreement across factions is still hard to have happen but it produces strong incentive for disagreement across factions to be cordial, which I think has mildly increased sanity of the people on bluesky (but… phew, there’s a long way to go and the bluesky population’s view on AI seems pretty insane in a way that seems immune to evidence), but I like it in many ways; it’s weaker than the LW/facebook model where you can just ban anyone.
To a first approximation, they are as likely as you to be biased, so why do they get to be the judge?
I think the answer to this is, “because the post, specifically, is the author’s private space”. So they get to decide how to conduct discussion there (for reference, I always set moderation to Easy Going on mine, but I can see a point even to Reign of Terror if the topic is spicy enough). The free space for responses and rebuttals isn’t supposed to be the comments of the post, but the ability to write a different post in reply.
I do agree that in general if it comes to that—authors banning each other from comments and answering just via new posts—then maybe things have already gotten a bit too far into “internet drama” land and everyone could use some cooling down. And it’s generally probably easier to keep discussions on a post in the comments of the post. But I don’t think the principle is inherently unfair; you have the same exact rights as the other person and can always respond symmetrically, that’s fairness.
The free space for responses and rebuttals isn’t supposed to be the comments of the post, but the ability to write a different post in reply.
I want to just note, for the sake of the hypothesis space, a probably-useless idea: There could somehow be more affordance for a middle ground of “offshoot” posting. In other words, structurally formalize / enable the pattern that Anna exhibited in here comment here:
And the ensuing discussion seemed productive. This kinda a bit like quote-tweeting as opposed to replying. The difference between just making your own shortform post would be that it’s a shortform post, but also paired with a comment on the original post. This would be useful if, as in the above example, the OP author asked for a topic to be discussed in a different venue; or if a commenter wants to discuss something, and also notify the author, and also make their comment visible to other people reading the comments on the OP, but wants to have their own venue or wants to avoid taking up attention in the OP because of off-topic or whatever reason.
That’s not a bad idea. You could link something like “this post is a reply to X” and then people could explore “threads” of posts that are all rebuttals and arguments surrounding a single specific topic. Doesn’t even need to be about things that have gotten this hostile, sometimes you just want to write a full post because it’s more organic than a comment.
Right. I realized later that I framed this as something the commenter decides; it would also be possible to have this sort of thing replace authors deleting comments or ban users. The author could press the “boot” button, and then this boots the comment out of the comment section. But it doesn’t delete it, it just moves all discussion to wherever the comment was booted to (e.g. open thread or quick take or something). Maybe it also hides most of the comment, and shows a single response from the author. (Not especially advocating for this.)
I think the answer to this is, “because the post, specifically, is the author’s private space”.
I think that’s the official explanation, but even the site admins don’t take it seriously. Because if this is supposed to be true, then why am I allowed to write and post replies directly from the front page Feed, where all the posts and comments from different authors are mixed together, and authors’ moderation policies are not shown anywhere? Can you, looking at that UI, infer that those posts and comments actually belong to different “private spaces” with different moderators and moderation policies?
This is indeed a notable design flaw. In principle, I think it allows one to get a highly upvoted top comment on a post which entirely misunderstands or doesn’t have anything to do with the top-level post.
Can I, looking at that UI, see how to get others “private spaces” out of my brain? The core mechanic of reading the site appears to be non-consensual exposure to others private (sometimes) nonsense.
@habryka I talked about why I almost never delete or ban here. I guess that comment talks about a “commitment” to not do this, but I’ve internalized the reason enough that I also just don’t feel a need or desire for it.
I understand that you don’t! But almost everyone else who I do think has those attributes does not have those criteria. Like, Scott Alexander routinely bans people from ACX, even Said bans people from datasecretslox. I am also confident that the only reason why you would not ban people here on LW, is because the moderators are toiling for like 2 hours a day to filter out the people obviously ill-suited for LessWrong.
Sidechannel note: Said wishes it to be known that he neither bans people from DSL nor customarily has the right to, the task being delegated to moderators rather than the sysop. ( https://share.obormot.net/textfiles/MINHjLX7 )
Sure! I was including “setting up a system that bans other people” in my definition here. I am not that familiar with how DSL works, but given that it bans people, and it was set up by Said, felt confident that thereby somehow Said chose to build a system that does ban people.
Though if Said opposes DSL banning people (and he thinks the moderators are making a mistake when doing so) then I would want to be corrected!
I think he just objected to the phrasing. I do think “set up a system where people can be banned by others whom Said does not instruct on who to ban” is a stretch for “Said bans people from DSL.”
I have generally found Said to mean the things he says quite literally and to expect others to do so as well. It’s painful to read a conversation where one person keeps assigning subtext to another who quite clearly never intended to put it there.
Another reason for not wanting to moderate is that I’ve never regretted not moderating on LW, nor can recall any instances where some discussion of my post might have gone much better if I had. For example sometimes a discussion goes off in a direction that I’m not interested in, but I just ignore it and let it develop however the participants want. Or if someone is being dumb or does something wrong, I just downvote it and ignore it (maybe give an explanation if I feel like it). I can’t recall anything particularly bad happening as a result of these decisions.
If the benefits of author moderation are as low as they seem to me from my own experience, I just can’t imagine it being worth the costs. Am I just very lucky, or what?
Did Scott or Said ban people on LW when they were here? If not, then I would amend that part to say that on a platform like LW with moderators doing a reasonable job pre-filtering people, people like me don’t feel a need or desire to ban. Which doesn’t seem to change my point much. If yes, then it would appear that I overgeneralized from my own example.
I don’t know about Scott. Him being personally active on the site was long before my tenure as admin, and I am not even fully sure how moderation or deletion at the time worked.
I don’t think Said ever banned anyone, though he also wrote only a very small number of top-level posts, so there wasn’t much opportunity. My guess is he wouldn’t have even if he had been writing a lot of top-level posts.
More substantively, I think my feelings and policies are fundamentally based on a (near) symmetry between the author and commenter. If they are both basically LW users in good standing, why should the author get so much more power in a conflict/disagreement.[1] So this doesn’t apply to moderating/filtering out users who are just unsuitable for LW or one’s own site.
I mean I understand you have your reasons, but it doesn’t remove the unfairness. Like if in a lawsuit for some reason a disinterested judge can’t be found, and the only option is to let a friend of the plaintiff be the judge, that “reason” is not going to remove the unfairness.
They, being authors themselves, see the author’s pain firsthand, but the commenter’s feelings are merely an abstract report at most.
I do think there is a bunch of truth to this, but I am active on many other forums, and have e.g. been issued moderation warnings on the EA Forum, so I do experience moderation in other contexts (and of course get blocked on Twitter from time to time). Also, I… think authors are not that much less likely to ban moderators from their posts than other users. Of the maybe 30 users who have ever been banned from other user posts, one of them is a moderator:
I am sure that if Eliezer was more active on the site, my guess is people would be a bunch more likely to ban him from their posts than they would other people for the same behavior. In general, in my experience, tallest-poppy dynamics are stronger in the rationality community than leadership-deference dynamics.
What’s your preferred policy alternative? How could it be abused? What are your contingencies for when it inevitably does get abused? (Like, from what I recall about LW1.0′s history, lack of content moderation and protection from spammers was something that supposedly almost killed the site, or something.)
I’ve talked about that elsewhere and don’t want to get into it again. Probably most of it was here but I’m not totally sure. Edit: oh actually you’re talking about a different issue, which does not actually seem to be a problem that the author moderation system is supposed to address, as I’ve almost never seen spammers on my own posts or anyone else’s.
What do people think about having more AI features on LW? (Any existing plans for this?) For example:
AI summary of a poster’s profile, that answers “what should I know about this person before I reply to them”, including things like their background, positions on major LW-relevant issues, distinctive ideas, etc., extracted from their post/comment history and/or bio links.
“Explain this passage/comment” based on context and related posts, similar to X’s “explain this tweet” feature, which I’ve often found useful.
“Critique this draft post/comment.” Am I making any obvious mistakes or clearly misunderstanding something? (I’ve been doing a lot of this manually, using AI chatbots.)
“What might X think about this?”
Have a way to quickly copy all of someone’s posts/comments into the clipboard, or download as a file (to paste into an external AI).
I’ve been thinking about doing some of this myself (e.g., update my old script for loading all of someone’s post/comment history into one page), but of course would like to see official implementations, if that seems like a good idea.
These things can be done well with a browser extension, and this way they can work on other sites too. I recently made some browser extensions and it’s really easy.
That said, I’d rather people didn’t do this. The walls between human beings are already pretty thick. It’s alienating enough to talk with someone by text on the internet, with all the chances of misunderstanding that implies; it’ll be even harder if I know the other person is responding to an AI-rewritten version of my comment, referring to an AI-summarized version of my profile, running AI hypotheticals on how I would react, and so on. The human would be almost invisible behind the layers of colored glass.
it’ll be even harder if I know the other person is responding to an AI-rewritten version of my comment, referring to an AI-summarized version of my profile, running AI hypotheticals on how I would react
I think all of these are better than the likely alternatives though, which is that
I fail to understand someone’s comment or the reasoning/motivations behind their words, and most likely just move on (instead of asking them to clarify)
I have little idea what their background knowledge/beliefs are when replying to them
I fail to consider some people’s perspectives on some issue
It also seems like I change my mind (or at least become somewhat more sympathetic) more easily when arguing with an AI-representation of someone’s perspective, maybe due to less perceived incentive to prove that I was right all along.
This seems like one-shot reasoning though. If you extend it to more people, the end result is a world where everyone treats understanding people as a chore to be outsourced to AI. To me this is somewhere I don’t want to go; I think a large part of my values are chores that I don’t want to outsource. (And in fact this attitude of mine began quite a few steps before AI, somewhere around smartphones.)
Hmm, I find it hard to understand or appreciate this attitude. I can’t think of any chores that I intrinsically don’t want to outsource, only concerns that I may not be able to trust the results. What are some other examples of chores you do and don’t want to outsource? Do you have any pattern or explanation of where you draw the line? Do you think people who don’t mind outsourcing all their chores are wrong in some way?
There’s no “line” per se. The intuition goes something like this. If my value system is only about receiving stuff from the universe, then the logical endpoint is a kind of blob that just receives stuff and doesn’t even need a brain. But if my value system is about doing stuff myself, then the logical endpoint is Leonardo da Vinci. To me that’s obviously better. So there are quite a lot of skills—like doing math, playing musical instruments, navigating without a map, or understanding people as in your example—that I want to do myself even if there are machines that could do it for me cheaper and better.
If my value system is only about receiving stuff from the universe, then the logical endpoint is a kind of blob that just receives stuff and doesn’t even need a brain.
Unless one of the things you want to receive from the universe is to be like Leonardo da Vinci, or be able to do everything effortlessly and with extreme competence. Why “do chores” now if you can get to that endpoint either way, or maybe even more likely if you don’t “do chores” because it allows you to save on opportunity costs and better deploy your comparative advantage? (I can understand if you enjoy the time spent doing these activities, but by calling them “chores” you seem to be implying that you don’t?)
Well, there’s no point in asking the AI to make me good at things if I’m the kind of person who will just keep asking the AI to do more things for me! That path just leads to the consumer blob again. The only alternative is if I like doing things myself, and in that case why not start now. After all, Leonardo himself wasn’t motivated by the wish to become a polymath, he just liked doing things and did them. Even when then they’re a bit difficult (“chores”).
Anyway that was the theoretical argument, but the practical argument is that it’s not what’s being offered now. We started talking about outsourcing the task of understanding people to AI, right? That doesn’t seem like a step toward Leonardo to me! It would make me stop using a pretty important part of my mind. Moreover, it’s being offered by corporations that would love to make me dependent, and that have a bit of history getting people addicted to stuff.
Well, there’s no point in asking the AI to make me good at things if I’m the kind of person who will just keep asking the AI to do more things for me!
But I’m only asking the AI to do things for me because they’re too effortful or costly. If the AI made me good at these things with no extra effort or cost (versus asking the AI to do it) then why wouldn’t I do them myself? For example I’m pretty sure I’d love the experience of playing like a concert pianist, and would ask for this ability, if doing so involved minimal effort and cost.
On the practical side, I agree that atrophy and being addicted/exploited are risks/costs worth keeping in mind, but I’ve generally made tradeoffs more in the direction of using shortcuts to minimize “doing chores” (e.g., buying a GPS for my car as soon as they came out, giving up learning an instrument very early) and haven’t regretted it so far.
(This thread is getting a bit long, and we might not be convincing each other very much, so hope it’s ok if I only reply with points I consider interesting—not just push-pull.)
With the concert pianist thing I think there’s a bit of type error going on. The important skill for a musician isn’t having fast fingers, it’s having something to say. Same as: “I’d like to be able to write like a professional writer”—does that mean anything? You either have things you want to write in the way that you want to write, or there’s no point being a writer at all, much less asking an AI to make you one. With music or painting it’s the same. There’s some amount of technique required, but you need to have something to say, otherwise there’s no point doing it.
So with that in mind, maybe music isn’t the best example in your case. Let’s take an area where you have something to say, like philosophy. Would you be willing to outsource that?
Let’s take an area where you have something to say, like philosophy. Would you be willing to outsource that?
Outsourcing philosophy is the main thing I’ve been trying to do, or trying to figure out how to safely do, for decades at this point. I’ve written about it in various places, including this post and my pinned tweet on X. Quoting from the latter:
Among my first reactions upon hearing “artificial superintelligence” were “I can finally get answers to my favorite philosophical problems” followed by “How do I make sure the ASI actually answers them correctly?”
Aside from wanting to outsource philosophy to ASI, I’d also love to have more humans who could answer these questions for me. I think about this a fair bit and wrote some things down but don’t have any magic bullets.
(I currently think the best bet to eventually getting what I want is to encourage an AI pause along with genetic enhancements for human intelligence, have the enhanced humans solve metaphilosophy and other aspects of AI safety, then outsource the rest of philosophy to ASI, or have the enhanced humans decide what to do at that point.)
BTW I thought this would be a good test for how competent current AIs are at understanding someone’s perspective so I asked a bunch of them how Wei Dai would answer your question, and all of them got it wrong on the first try, except Claude Sonnet 4.5 which got it right on the first try but wrong on the second try. It seems like having my public content in their training data isn’t enough, and finding relevant info from the web and understanding nuance are still challenging for them. (GPT-5 essentially said I’d answer no because I wouldn’t trust current AIs enough, which is really missing the point despite having this whole thread as context.)
Yeah, I wouldn’t have predicted this response either. Maybe it’s a case of something we talked about long ago—that if a person’s “true values” are partly defined by how the person themselves would choose to extrapolate them, then different people can end up on very diverging trajectories. Like, it seems I’m slightly more attached to some aspects of human experience that you don’t care much about, and that affects the endpoint a lot.
I would like to do more work on this kind of stuff, and expect to do so after a current big batch of back-end refactors is done (not commenting on whether we might do any of these specific AI features, but it seems clear that we will want to figure out how to integrate AI into both discussion and content production on LW somehow).
Thanks for letting me know. Is there anything on my list that you don’t think is a good idea or probably won’t implement, in which case I might start working on them myself, e.g. as a userscript? Especially #5, which is also useful for other reasons, like archiving and searching.
I think we are unlikely to do #2 based on my current guesses of what are good ideas. I think #1 is also kind of unlikely. I think some version of 3,4 and 5 are definitely things I want to explore.
Hey, it’s been 6 months. Can I get an updated ETA on 5 please? If it’s going to take much longer, please let me know and I’ll just code up something myself.
We have a bunch of UI that I would need to modify only a tiny bit to get you #5, I think.
If you imagine the UI at lesswrong.com/autocompleteSettings, but with a “copy to clipboard” button at the bottom, and a user search menu at the top (instead of just having Gwern, Eliezer and Scott), would that work for you?
(Note that the list-plus icon button that you get when hovering over a list entry allows you to mark all elements above the item in the list you are hovering over as checked, so you don’t have to click on dozens of comments manually)
Need: A way to load all comments and posts of a user. Right now it only loads the top N by karma.
Want: A “download” button, for some users who have up to hundreds of MB of content, too unwieldy to copy/paste. Ability to collate/sort in various ways, especially as flat list of mixed posts and comments, sorted by posting date from oldest to newest.
The “load all” and “download” parts are a bit tricky. I don’t super want to give people buttons to download hundreds of megabytes of data from our database, which is at greater risk of causing our DB to fall over and delay other queries people are waiting on. Probably it’s doable, but it makes the feature a bit more complicated.
I’ll give it a quick try tomorrow or by latest end of next week and see whether I can get you what you want. I’ll let you know if turns out more than a hour or two complicated and I might delay it further.
I’m a huge fan, especially for the user-specific, ephemeral uses like you describe. “Summarize the major contrasting views to this post” would be awesome. I’m much less happy with publication and posting-support uses that would be the obvious things to do.
Are humans fundamentally good or evil? (By “evil” I mean something like “willing to inflict large amounts of harm/suffering on others in pursuit of one’s own interests/goals (in a way that can’t be plausibly justified as justice or the like)” and by “good” I mean “most people won’t do that because they terminally care about others”.) People say “power corrupts”, but why isn’t “power reveals” equally or more true? Looking at some relevant history (people thinking Mao Zedong was sincerely idealistic in his youth, early Chinese Communist Party looked genuine about wanting to learn democracy and freedom from the West, subsequent massive abuses of power by Mao/CCP lasting to today), it’s hard to escape the conclusion that altruism is merely a mask that evolution made humans wear in a context-dependent way, to be discarded when opportune (e.g., when one has secured enough power that altruism is no longer very useful).
After writing the above, I was reminded of @Matthew Barnett’s AI alignment shouldn’t be conflated with AI moral achievement, which is perhaps the closest previous discussion around here. (Also related are my previous writings about “human safety” although they still used the “power corrupts” framing.) Comparing my current message to his, he talks about “selfishness” and explicitly disclaims, “most humans are not evil” (why did he say this?), and focuses on everyday (e.g. consumer) behavior instead of what “power reveals”.
At the time, I replied to him, “I think I’m less worried than you about “selfishness” in particular and more worried about moral/philosophical/strategic errors in general.” I guess I wasn’t as worried because it seemed like humans are altruistic enough, and their selfish everyday desires limited enough that as they got richer and more powerful, their altruistic values would have more and more influence. In the few months since then, I’ve became more worried, perhaps due to learning more about Chinese history and politics...
My model is that the concept of “morality” is a fiction which has 4 generators that are real:
People have empathy, which means they intrinsically care about other people (and sufficiently person-like entities), but, mostly about those in their social vicinity. Also, different people have different strength of empathy, a minority might have virtually none.
Superrational cooperation is something that people understand intuitively to some degree. Obviously, a minority of people understand it on System 2 level as well.
There is something virtue-ethics-like which I find in my own preferences, along the lines of “some things I would prefer not to do, not because of their consequences, but because I don’t want to be the kind of person who would do that”. However, I expect different people to differ in this regard.
The cultural standards of morality, which it might be selfishly beneficial to go along with, including lying to yourself that you’re doing it for non-selfish reasons. Which, as you say, becomes irrelevant once you secure enough power. This is a sort of self-deception which people are intuitively skilled at.
Comparing my current message to his, he talks about “selfishness” and explicitly disclaims, “most humans are not evil” (why did he say this?), and focuses on everyday (e.g. consumer) behavior instead of what “power reveals”.
The reason I said “most humans are not evil” is because I honestly don’t think the concept of evil, as normally applied, is a truthful way to describe most people. Evil typically refers to an extraordinary immoral behavior, in the vicinity of purposefully inflicting harm to others in order to inflict harm intrinsically, rather than out of indifference, or as a byproduct of instrumental strategies to obtain some other goal. I think the majority of harms that most people cause are either (1) byproducts of getting something they want, which is not in itself bad (e.g. wanting to eat meat), or (2) the result of their lack of will to help others (e.g. refusing to donate any income to those in poverty).
By contrast, I focused on consumer behavior because the majority of the world’s economic activity is currently engaged in producing consumer products and services. There exist possible worlds in which this is not true. During World War 2, the majority of GDP in Nazi Germany was spent on hiring soldiers, producing weapons of war, and supporting the war effort more generally—which are not consumer goods and services.
Focusing on consumer preferences a natural thing to focus on if you want to capture intuitively “what humans are doing with their wealth”, at least in our current world. Before focusing on something else by default—such as moral preferences—I’d want to hear more about why those things are more likely to be influential than ordinary consumer preferences in the future.
You mention one such argument along these lines:
I guess I wasn’t as worried because it seemed like humans are altruistic enough, and their selfish everyday desires limited enough that as they got richer and more powerful, their altruistic values would have more and more influence.
I just think it’s not clear it’s actually true that humans get more altruistic as they get richer. For example, is it the case that selfish consumer preferences have gotten weaker in the modern world, compared to centuries ago when humans were much poorer on a per capita basis? I have not seen a strong defense of this thesis, and I’d like to see one before I abandon my focus on “everyday (e.g. consumer) behavior”.
Evil typically refers to an extraordinary immoral behavior, in the vicinity of purposefully inflicting harm to others in order to inflict harm intrinsically, rather than out of indifference, or as a byproduct of instrumental strategies to obtain some other goal.
Ok, I guess we just define/use it differently. I think most people we think of as “evil” probably justify inflicting harm to others as instrumental to some “greater good”, or are doing it to gain or maintain power, not because they value it for its own sake. I mean if someone killed thousands of people in order to maintain their grip on power, I think we’d call them “evil” and not just “selfish”?
I just think it’s not clear it’s actually true that humans get more altruistic as they get richer.
I’m pretty sure that billionaires consume much less as percent of their income, compared to the average person. EA funding comes disproportionately from billionaires, AFAIK. I personally spend a lot more time/effort on altruistic causes, compared to if I was poorer. (Not donating much though for a number of reasons.)
For example, is it the case that selfish consumer preferences have gotten weaker in the modern world, compared to centuries ago when humans were much poorer on a per capita basis?
I’m thinking that we just haven’t reached that inflection point yet, where most people run out of things to spend selfishly on (like many billionaires have, and like I have to a lesser extent). As I mentioned in my reply to your post, a large part of my view comes from not being able to imagine what people would spend selfishly on, if each person “owned” something like a significant fraction of a solar system. Why couldn’t 99% of their selfish desires be met with <1% of their resources? If you had a plausible story you could tell about this, that would probably change my mind a lot. One thing I do worry about is status symbols / positional goods. I tend to view that as a separate issue from “selfish consumption” but maybe you don’t?
I like the insight regarding power corrupting or revealing. I think perhaps both might be true and, if so, we should keep both lines of though in mind when thinking about these types of questions.
My general view is that most people are generally good when you’re talking about individual interactions. I’m less confident in that when one brings in the in group-out of group aspects. I just am not sure how to integrate all that into a general view or princple about human nature.
A line I heard in some cheesey B-grade horror movies, related to the question of a personal nature and the idea that we all have competing good and bad wolves inside. One of the characters asks which wolve was strongest, the good wolf or the bad wolf. The answer was “Which do you feed the most?”
“willing to inflict large amounts of harm/suffering on others in pursuit of one’s own interests/goals (in a way that can’t be plausibly justified as justice or the like)”
Yes, obviously.
The vast majority of people would inflict huge amounts of disutility on others if they thought they could get away with it and benefitted from it.
What then prevents humans from being more terrible to each other? Presumably, if the vast majority of people are like this, and they know that the vast majority of others are also like this, up to common knowledge, I don’t see how you’d get a stable society in which people aren’t usually screwing each other a giant amount.
I don’t think altruism is evolutionarily connected to power as you describe. Caesar didn’t come to power by being better at altruism, but by being better at coordinating violence. For a more general example, the Greek and other myths don’t give many examples of compassion (though they give many other human values), it seems the modern form of compassion only appeared with Jesus, which is too recent for any evolutionary explanation.
So it’s possible that the little we got of altruism and other nice things are merely lucky memes. Not even a necessary adaptation, but more like a cultural peacock’s tail, which appeared randomly and might fix itself or not. While our fundamental nature remains that of other living creatures, who eat each other without caring much.
I think the way morality seems to work in humans is that we have a set of potential moral values, determined by our genes, that culture can then emphasize or de-emphasize. Altruism seems to be one of these potential values, that perhaps got more emphasized in recent times, in certain cultures. I think altruism isn’t directly evolutionarily connected to power, and it’s more like “act morally (according to local culture) while that’s helpful for gaining power” which translates to “act altruistically while that’s helpful for gaining power” in cultures that emphasize altruism. Does this make more sense?
Yeah, that seems to agree with my pessimistic view—that we are selfish animals, except we have culture, and some cultures accidentally contain altruism. So the answer to your question “are humans fundamentally good or evil?” is “humans are fundamentally evil, and only accidentally sometimes good”.
I think altruism isn’t directly evolutionarily connected to power, and it’s more like “act morally (according to local culture) while that’s helpful for gaining power” which translates to “act altruistically while that’s helpful for gaining power” in cultures that emphasize altruism. Does this make more sense?
I think that there is a version of an altruistic pursuit where one will, by default, “reduce his power.” I think this scenario happens when, in the process of attempting to do good, one exposes himself more to unintended consequences. The person who sacrifices will reduce his ability to exercise power, but he may regain or supersede such loss if the tribe agrees with his rationale for such sacrifice.
Just because it was not among the organizing principles of any of the literate societies before Jesus does not mean it is not part of the human mental architecture.
There are several levels in which humans can be bad or evil:
Doing bad things because they believe them to be good
Doing bad things while not caring whether they are bad or not
Doing bad things because they believe them to be bad (Kant calls this “devilish”)
I guess when humans are bad, they usually do 1). Even Hitler may have genuinely thought he is doing the morally right thing.
Humans also sometimes do 2), for minor things. But rarely if the anticipated bad consequences are substantial. People who consistently act according to 2) are called psychopaths. They have no inherent empathy for other people. Most humans are not psychopathic.
Humans don’t do 3), they don’t act evil for the sake of it. They aren’t devils.
I have a feeling that for many posts that could be posted as either normal posts or as shortform, they would get more karma as shortform, for a few possible reasons:
lower quality bar for upvoting
shortforms showing some of the content, which helps hook people in to click on it
people being more likely to click on or read shortforms due to less perceived effort of reading (since they’re often shorter and less formal)
This seems bad because shortforms don’t allow tagging and are harder to find in other ways. (People are already more reluctant to make regular posts due to more perceived risk if the post isn’t well received, and the above makes it worse.) Assuming I’m right and the site admins don’t endorse this situation, maybe they should reintroduce the old posting karma bonus multiplier, but like 2x instead of 10x, and only for positive karma? Or do something else to address the situation like make the normal posts more prominent or enticing to click on? Perhaps show a few lines of the content and/or display the reading time (so there’s no attention penalty for posting a literally short post as a normal post)?
Some months ago, I suggested that there could be an UI feature to automatically turn shortforms into proper posts if they get sufficent karma, that authors could turn on or off.
One potential issue is that this makes posting shortforms even more attractive, so you might see everything being initially posted as shortforms (except maybe very long effortposts) since there’s no downside to doing that. I wonder if that’s something the admins want to see.
It seems like the quality of short form writing that displaces what would otherwise have been full posts will generally be lower. But on the other hand, people might feel more willing to publish at all, because they don’t have to make the assessment of whether or not they’re good enough to be worth making a bid that other people read it.
I want a lesswrong canny—here’s vrchat’s canny for comparison. Canny (or similar systems, eg “the feature requests post”) are nice because upvoted features have no particular need of being implemented. It also means that the “why users want this” feedback channel is higher bandwidth than just guess-test-and-ask; with a central place for people to comment on features and indicate their preferences, it’s much easier for someone to go see what feature suggestions there are and add their feedback. I have several feature requests I’d add which the LW team has been hesitant about, and I think that if they made affordance for many users to comment on potential features, it would become clearer which ones are actually wanted by many people and why; and yet it would not force the LW team to implement any one feature, so requests that they’d prefer to reject or satisfy a different way would be still possible.
So, I formally request that the LW team make a feature requests post and pin it somewhere discoverable but not overly prominent, eg the top right user menu.
A persistent place to look for feedback which creates common knowledge of desired features seems likely to me (70% ish?) to make it obvious that shortform-to-post is one of the top 4 most desired features out of dozens.
There is a sort of upside to this, in that to the extent that people are more inclined to post shortforms than longforms due to the lower perceived/expected effort of the former, there is a possibility of (optional?) UX engineering to make writing longforms feel a bit more like writing shortforms, so that people who have something to write but also have a feeling of “ugh, that would be a lot of effort, I’ll do it when I’m not as tired [or whatever]” would be more inclined to write and post it.
Relatedly, every few days, I find myself writing some long and detailed message in a DM, which I would be less motivated to write in my personal notes, let alone write a blog post about it, and sometimes the message turns out to look like a first draft of a blog post.[1] How to hijack this with UX?[2]
After I started talking about it, I found out that apparently “write an article like a message to an intellectual-peer friend” is something like a folk advice.
I think posts should be displayed more like shortforms so that one isn’t limited to a title to make the argument for reading more. Ideally, it would come with a recommendation for how to put “who should read this post, and when? what do they get out of it?” info at the top of the post, within blurb length limit.
Philosophy as meta problem solving
Given that philosophy is extremely slow, it makes sense to use it to solve meta problems (i.e., finding faster ways to handle some class of problems) instead of object level problems. This is exactly what happened historically. Instead of using philosophy to solve individual scientific problems (natural philosophy) we use it to solve science as a methodological problem (philosophy of science). Instead of using philosophy to solve individual math problems, we use it to solve logic and philosophy of math. [...] Instead of using philosophy to solve individual philosophical problems, we can try to use it to solve metaphilosophy.
It occurred to me that from the perspective of longtermist differential intellectual progress, it was a bad idea to invent things like logic, mathematical proofs, and scientific methodologies, because it permanently accelerated the wrong things (scientific and technological progress) while giving philosophy only a temporary boost (by empowering the groups that invented those things, which had better than average philosophical competence, to spread their culture/influence). Now we face the rise of China and/or AIs, both of which seem likely (or at least plausibly) to be technologically and scientifically (but not philosophically) competent, perhaps in part as a result of technological/scientific (but not philosophical) competence having been made legible/copyable by earlier philosophers.
If only they’d solved metaphilosophy first, or kept their philosophy of math/science advances secret! (This is of course not entirely serious, in case that’s not clear.)
I’d suggest that trying to understand what values are would potentially have been a better direction to emphasize. Our understanding here is still pre-Socratic, basically pre-cultural.
It seems to me that values have been a main focus of philosophy for a long time, with moral philosophy (or perhaps meta-ethics if the topic is “what values are”) devoted to it and discussed frequently both in academia and out, whereas metaphilosophy has received much less attention. This implies that we know progress on understanding values is probably pretty hard on the current margins, whereas there’s a lot more uncertainty about the difficulty of metaphilosophy. Solving the latter would also be of greater utility, since it makes solving all other philosophical problems easier, not just values. I’m curious about the rationale behind your suggestion.
It seems to me that values have been a main focus of philosophy for a long time
I’m curious about the rationale behind your suggestion.
Specifically the question of “what values are” I don’t think has been addressed (I’ve looked around some, but certainly not thoroughly). A key problem with previous philosophy is that values are extreme in how much they require some sort of mental context (https://www.lesswrong.com/posts/HJ4EHPG5qPbbbk5nK/gemini-modeling). Previous philosophy (that I’m aware of) largely takes the mental context for granted, or only highlights the parts of it that are called into question, or briefly touches on it. This is pretty reasonable if you’re a human talking to humans, because you do probably share most of that mental context. But it fails on two sorts of cases:
trying to think about or grow/construct/shape really alien minds, like AIs;
trying to exert human values in a way that is good but unnatural (think for example of governments, teams, “superhuman devotion to a personal mission”, etc.).
The latter, 2., might have, given more progress, helped us be wiser.
My comment was responding to
it was a bad idea to invent things like logic, mathematical proofs, and scientific methodologies, because it permanently accelerated the wrong things (scientific and technological progress) while giving philosophy only a temporary boost (by empowering the groups that invented those things, which had better than average philosophical competence, to spread their culture/influence).
So I’m saying, in retrospect on the 2.5 millennia of philosophy, it plausibly would have been better to have an “organology, physiology, medicine, and medical enhancement” of values. To say it a different way, we should have been building the conceptual and introspective foundations that would have provided the tools with which we might have been able to become much wiser than is accessible to the lone investigators who periodically arise, try to hack their way a small ways up the mountain, and then die, leaving mostly only superficial transmissions.
whereas metaphilosophy has received much less attention.
I would agree pretty strongly with some version of “metaphilosophy is potentially a very underserved investment opportunity”, though we don’t necessarily agree (because of having “very different tastes” about what metaphilosophy should be, amounting to not even talking about the same thing). I have ranted several times to friends about how philosophy (by which I mean metaphilosophy—under one description, something like “recursive communal yak-shaving aimed at the (human-)canonical”) has barely ever been tried, etc.
I know @Wei Dai’s post isn’t entirely serious, but I want to flag that the position that we could have understood values/philosophy without knowing about math/logic is a fictional world/fabricated option.
It cannot exist, and updateleness can be taken too far with compute constraints.
I want to flag that the position that we could have understood values/philosophy without knowing about math/logic is a fictional world/fabricated option.
Maybe but I don’t believe that you know this. Lots of important concepts want to be gotten at by routes that don’t use much math or use quite different math from “math to understand computers” or “math to formalize epistemology”. Darwin didn’t need much math to get lots of the core structure of evolution by natural selection on random mutation.
Perhaps more seriously, the philosophers who got a temporary manpower and influence boost from the invention of math and science should have worked much harder to solve metaphilosophy, while they had the advantage.
This contradicts my position in Some Thoughts on Metaphilosophy. What about that post do you find unconvincing, or what is your own argument for “philosophy being insoluble”?
Talk about “values” is very popular on LessWrong, but much less common in philosophy or natural language. I confess I don’t even know what you mean with “trying to understand what values are”. Can you make the problem statement more precise, perhaps without reference to “values”?
Old High rationalism had a world view in which rationality, AI and ethics were all roughly the same thing: the optimisation of utility/value (which are the same thing). Rationality was von Neumann rationality; an AI was a utility maximising agent; the only possible theory of ethics is utilitarianism, IE utility maximisation, IE value maximisation.
Philosophers have discussed these under the term “desires”. I think there was a lot of progress since the time of the pre-Socratics. Aristotle’s practical syllogism, Buridan’s donkey, Hume emphasis of the independence of beliefs and desires, Kant’s distinction between theoretical reason and practical reason, direction of fit, Richard Jeffrey’s utility theory (where utilities are degrees of desire), analysis of akrasia by various analytic philosophers, Nozick’s experience machine, and various others.
“A lot of progress”.… well, reality doesn’t grade on a curve. Surely someone has said something about something, yes, but have we said enough about what matters? Not even close. If you don’t know how inadequate our understanding of values is I can’t convince you in a comment, but one way to find out would be to try to solve alignment. E.g. see https://tsvibt.blogspot.com/2023/03/the-fraught-voyage-of-aligned-novelty.html
There is quite the difference between “our understanding is still pre-Socratic” and “we haven’t said enough”. In general I think very few people here (not sure whether this applies to you) are familiar with the philosophical literature on topics in this area. For example, there is very little interest on LessWrong in normative ethics and the associated philosophical research. Even though this is directly related to alignment, since, if you you have an intent-aligned ASI (which is probably easier to achieve than shooting straight for value alignment) you probably need to know what ethics it should implement when asking it to create a fully value-aligned ASI.
Interestingly, the situation is quite different for the EA Forum, where there are regular high-quality posts on solving issues in normative ethics with reference to the academic literature, like the repugnant conclusion, the procreation asymmetry and the status of person-affecting theories. Any satisfactory normative ethical theory needs to solve these problems, similar to how any satisfactory normative theory of epistemic rationality needs to solve the various epistemic paradoxes and related issues.
Again, I don’t know whether this applies to you, but most cases of “philosophy has made basically no progress on topic X” seem to come from people who have very little knowledge of the philosophical literature on topic X.
I’m not sure. I did put in some effort to survey various strands of philosophy related to axiology, but not much effort. E.g. looked at some writings in the vein of Anscombe’s study of intention; tried to read D+G because maybe “machines” is the sort of thing I’m asking about (was not useful to me lol); have read some Heidegger; some Nietzsche; some more obscure things like “Care Crosses the River” by Blumenberg; the basics of the “analytical” stuff LWers know (including doing some of my own research on decision theory); etc etc. But in short, no, none of it even addresses the question—and the failure is the sort of failure that was supposed to have its coarsest outlines brought to light by genuinely Socratic questioning, which is why I call it “pre-Socratic”, not to say that “no one since Socrates has billed themselves as talking about something related to values or something”.
I think even communicating the question would take a lot of work, which as I said is part of the problem. A couple hints:
You should think of the question of values as being more like “what is the driving engine” rather than “what are the rules” or “what are the outcomes” or “how to make decisions” etc.
Prestige status is surprisingly useless in domestic life, and dominance status is surprisingly often held by the female side, even in traditional “patriarchal” societies.
Examples: Hu Shih, the foremost intellectual of 1920s China (Columbia PhD, professor of Peking University and later its president), being “afraid” of his illiterate, foot-bound wife and generally deferring to her. Robin Hanson’s wife vetoing his decision to sell stocks ahead of COVID, and generally not trusting him to trade on their shared assets.
Not really sure why or how to think about this, but thought I’d write down this observation… well a couple of thoughts:
Granting or recognizing someone’s prestige may be a highly strategic (albeit often subconscious) decision, not something you just do automatically.
These men could probably win more dominance status in their marriages if they tried hard, but perhaps decided their time and effort was better spent to gain prestige outside. (Reminds me of comparative advantage in international trade, except in this case you can’t actually trade the prestige for dominance.)
The Robin Hanson example doesn’t show that dominance is held by his wife, Peggy Jackson, unless you have tweets from her saying that she decided to trade a lot of stocks, he tried to veto it, and she overruled his veto and did it anyway. They could have a rule where large shared investment decisions are made with the consent of both sides. Some possibilities:
You’re surprised by the absence of male dominance, not the presence of female dominance.
You interpreted a partner-veto as partner-dominance, instead of joint decision-making.
Peggy Jackson is dominant in their relationship but you picked a less compelling example.
This from the same tweet reads as Robin Hanson getting his way in a dispute:
I stocked us up for 2 mo. crisis, though wife resisted, saying she trusted CDC who said 2 wk. is plenty.
You seem to have left out the fact that Robin Hanson is a renowned economics expert and likely has more skill in deciding when to sell stocks than his spouse.
I think modeling the social dynamics of two people in a marriage with status, a high-level abstraction typically applied for groups, doesn’t make much sense. Game theory would make more sense imo.
@Lukas Finnveden I originally heard it in a Chinese language YouTube video about Hu Shih. I tried to look up a reference before making the post, but couldn’t find anything online that was well-sourced so decided not to cite anything, but this is the best that I can find, with English translation below:
AI translated article about Hu Shih’s marriage
The Marriage of Hu Shih and Jiang Dongxiu: One of the Great Oddities of the Republican Era
“You want a divorce? Fine. Since you don’t want us anymore, I’ll kill our two sons first, then kill myself right in front of you!” Jiang Dongxiu said, full of grief and indignation.
Hu Shih looked in horror at the kitchen knife held to the neck of his 5-year-old son, Liwang, and at his 2-year-old son, Sidu, in his wife’s arms. Terrified, he fled. All thoughts of divorce could go to hell.
The woman holding the kitchen knife was Jiang Dongxiu, the arranged-marriage wife of Hu Shih, a leader of the New Culture Movement and a great literary master. Their marriage was known as one of the “Great Oddities of the Republican Era.”
The pairing of a cultural elite with an uneducated village woman was not surprising in itself; Lu Xun, Xu Zhimo, and Yu Dafu all abandoned their first wives. What seemed strange was that Hu Shih never abandoned Jiang Dongxiu.
Was Jiang Dongxiu Hu Shih’s true love? No. Hu Shih agreed to marry Jiang Dongxiu entirely to please his mother. Thus, the thirteen-year-old boy and the fourteen-year-old girl were engaged. Afterwards, Hu Shih began his thirteen-year journey of study in Shanghai and the United States, while the yet-to-be-wed Jiang Dongxiu moved into the Hu family home to care for his mother.
During his studies in America, Hu Shih met the beautiful and intelligent American woman, Edith Clifford Williams.
[“Although in the past forty-eight hours, I have already written you two letters and a postcard, I can’t resist writing another letter...”]
From excerpts of their letters, one can see the fiery passion in their relationship. During this same period, Hu Shih’s views on women’s education also changed:
The purpose of women’s education is not to prepare them to be good wives and mothers, but to cultivate free and independent women.
This shift in thinking guided his actions. Across the ocean, Jiang Dongxiu received a letter from Hu Shih asking to break off the engagement.
Hu Shih’s formidable mother said bluntly, “This is absolutely impossible, get that idea out of your head!” At the same time, she wrote a letter to Edith’s mother, informing her of the fact that Hu Shih already had a fiancée. In the end, this transnational romance was crushed by the iron will of the two matriarchs.
In 1917, the 27-year-old Hu Shih and the 28-year-old Jiang Dongxiu were finally married. Who could have known that behind this seemingly happy ending was the beginning of a nightmare.
At the wedding, the person who caught Hu Shih’s eye was not his bride, but the bridesmaid, Cao Chengying, who called him “Brother Mi.”
Cao Chengying loved flowers and literature. In her letters to Hu Shih’s wife, “Sister-in-law Dongxiu,” she would enclose poems for “Brother Mi” to appreciate, which Hu Shih was always delighted to do.
The affair between them began during a visit. When Hu Shih was recuperating from an illness in Hangzhou, Cao Chengying, who worked there, went to visit him. The two toured Hangzhou together for four days. As they parted, he gave her a short vernacular poem titled “West Lake.”
The West Lake I dreamed of for seventeen years, Cannot cure my illness, But instead makes it worse ... Now that I’ve returned, I only find it more lovely, And thus cannot bear to leave so soon.
A few days later, Hu Shih secretly traveled from Shanghai to the shores of West Lake in Hangzhou, rented a few rooms in a temple, and lived with Cao Chengying for three months. Afterwards, he went home and formally asked his wife for a divorce, to which Jiang Dongxiu gave her forceful response with a kitchen knife.
From then on, Hu Shih never mentioned divorce again and became the model of a henpecked husband in literary circles. He even wrote the “New Three Obediences and Four Virtues Song”:
[The Three Obediences: One must obey when the wife gives an order; one must follow along when the wife goes shopping or plays mahjong; one must blindly obey when the wife loses her temper for the wrong reason. The Four Virtues: One must be willing to spend when the wife buys things; one must be able to endure when the wife gets angry; one must remember the wife’s birthday; one must be able to wait when the wife is dressing up to go out.]
As for Cao Chengying, she could only grieve in sorrow and had to abort the child she was carrying.
Can a relationship last long based only on threats and intimidation? A kitchen knife can kill, but it can also express tenderness.
Jiang Dongxiu could use a kitchen knife to defend her marriage, but she could also cook Hu Shih’s favorite Anhui dish, “Yipin Guo.” She provided meticulous care in their daily life, an all-powerful homemaker who never let Hu Shih worry about a thing. In her interactions with her husband, she would even speak straightforward words of love:
[“Mr. Gao said you can’t live in Shanghai anymore, he said your health this time is not as good as before. Today I heard him say you are not very well, and my heart felt like it was being cut by a knife. No matter what, I beg you, as soon as you see my letter, to hurry back to Beijing. I have been sick for three days...”]
This was a letter Hu Shih received from his wife while he was ill. Although it was filled with misspelled words, it delighted him, and he even wrote a poem in response:
Sick in bed, I get a letter from my wife, Not even eight full lines of paper; Nothing of importance in it, Yet it makes me quite happy.
In this back-and-forth, sweetness was found between the lines. The village woman with bound feet, Jiang Dongxiu, used both hard and soft tactics to manage the great literary master Hu Shih, securing her own position and growing old together with him.
Some quotes from it:
A few days later, Hu Shih secretly traveled from Shanghai to the shores of West Lake in Hangzhou, rented a few rooms in a temple, and lived with Cao Chengying for three months. Afterwards, he went home and formally asked his wife for a divorce, to which Jiang Dongxiu gave her forceful response with a kitchen knife.
From then on, Hu Shih never mentioned divorce again and became the model of a henpecked husband in literary circles. He even wrote the “New Three Obediences and Four Virtues Song”:
[The Three Obediences: One must obey when the wife gives an order; one must follow along when the wife goes shopping or plays mahjong; one must blindly obey when the wife loses her temper for the wrong reason. The Four Virtues: One must be willing to spend when the wife buys things; one must be able to endure when the wife gets angry; one must remember the wife’s birthday; one must be able to wait when the wife is dressing up to go out.]
intuitively, I would expect any hard coded psychological meta-rule that allows a wife to prevent her husband from day trading significant fractions of their wealth based on facts and logic to be a massive net positive to reproductive fitness over the past 3000 years. It clearly didn’t work this time, but that doesn’t mean it was a bad idea over a population.
It is easier to impress people who know you less, because you can choose what to show them, and they don’t see the rest.
For example, a painter can show his 20 best paintings he made during the last 10 years at an exhibition. People are deeply impressed. His wife also knows about hundreds of paintings he destroyed because he thought they were ugly, and about months when he didn’t paint anything and he was just sitting depressed at home and drinking alcohol. His wife is much less impressed. She would appreciate more help at home and with kids; also the money he brings is negligible, and he spends most of it on alcohol anyway.
This is a fictional example, but the idea is that the public sees your best, while the people around you see your average.
Who knows, maybe Robin Hanson shares his contrarian ideas with his wife first, she convinces him to abandon the most absurd ones, and he publishes the rest. Maybe after hearing all the crazy ideas we didn’t hear, it made perfect sense to distrust his ideas about selling stocks. -- I am just imagining all this; I have no evidence for that.
High population may actually be a problem, because it allows the AI transition to occur at low average human intelligence, hampering its governance. Low fertility/population would force humans to increase average intelligence before creating our successor, perhaps a good thing!
This assumes that it’s possible to create better or worse successors, and that higher average human intelligence would lead to smarter/better politicians and policies, increasing our likelihood of building better successors.
Some worry about low fertility leading to a collapse of civilization, but embryo selection for IQ could prevent that, and even if collapse happens, natural selection would start increasing fertility and intelligence of humans again, so future smarter humans should be able to rebuild civilization and restart technological progress.
Added: Here’s an example to illustrate my model. Assume a normally distributed population with average IQ of 100 and we need a certain number of people with IQ>130 to achieve AGI. If the total population was to half, then to get the same absolute number of IQ>130 people as today, average IQ would have to increase by 4.5, and if the population was to become 1⁄10 of the original, average IQ would have to increase by 18.75.
I find the idea interesting: To achieve a certain value of “total genius”, we either need a large population with a small fraction of geniuses, or a small population with a large fraction of geniuses.
(A third option is a small population with a small fraction of geniuses… and it takes a lot of time. The geniuses read each other’s books, rather than talk to each other directly. I think it was like this in the past. Very inefficient, because the information transfer by reading books is one-sided; does not allow collaboration in real time.)
I wonder how the heritability of IQ works, versus the reversion to the mean. Despite Pol Pot’s dystopian project, the average IQ in Cambodia seems to be… average. What would happen to a country where let’s say half of the children are produced by artificial insemination, and half of the sperm comes from fathers with IQ 130 and above? If the mother is average, the child is likely to be an average between 100 and 130, so 115. On one hand, nothing exceptional; on the other hand, if the baseline is now slightly higher, then the next generation… and here comes the question how exactly the reversion to the mean works, and whether the constant injections of IQ 130 genes in the population could outrun it.
Reversion to the mean happens because the phenotype is an imperfect proxy for the genotype, so if you select a person with phenotypic IQ of 130, maybe their genotypic IQ is only 124, and therefore their offspring with an IQ 100 person would only be IQ 112 in expectation.
However, this expectation applies to both the offspring’s genotypic and phenotypic IQ, so this is the only regression to the mean you’re going to see; it’s not going to regress further down the line.
I was curious what Habryka meant when he said this. Don’t non-profits usually have some kind of board oversight? It turns out (from documents filed with the State of California), that Lightcone Infrastructure, which operates LW, is what’s known as a sole-member nonprofit, with a 1-3 person board of directors determined by a single person (member), namely Oliver Habryka. (Edit: My intended meaning here is that this isn’t just a historical fact, but Habryka still has this unilateral power. And after some debate in the comments, it looks like this is correct after all, but was unintentional. See Habryka’s clarification.)
However, it also looks like the LW domain is owned by MIRI, and MIRI holds the content license (legally the copyright is owned by each contributor and licensed to MIRI for use on LW). So if there was a big enough dispute, MIRI could conceivably find another team to run LW.
I’m not sure who owns the current code for LW, but I would guess it’s Lightcone, so MIRI would have to also recreate a codebase for it (or license GreaterWrong’s, I guess).
I was initially confused why Lightcone was set up that way (i.e., why was LW handed over to an organization controlled by a single person), but the structure probably makes it more nimble and the risk of Lightcone “going rogue” is mitigated to a large extent by MIRI retaining the option to swap out the team.
Anyway it took me a while to figure all this out, and I thought I’d share it so others would be informed while participating on LW.
why was LW handed over to an organization controlled by a single person
The community voted on it back in the day. MIRI wanted to make LessWrong a read-only archive. I said I would try to revive it. At the end of the period high-karma users voted whether to continue under my management, or archive the site. The former happened!
Ah thanks, I remember a bit more now. Looking back at the voting announcement posted by Vaniver, it didn’t mention how important your role would be on LW 2.0:
In case you’re just tuning in now, some basic details: I’ve been posting on LW for a long time, and about two years ago thought I was the person who cared most about making sure LW stayed alive, so decided to put effort into making sure that happened. But while I have some skills as a writer and a programmer, I’m not a webdev and not great at project management, and so things have been rather slow. My current role is mostly in being something like the ‘senior rationalist’ on the team, and supporting the team with my models of what should happen and why. The actual work is being done by a combination of Oliver Habryka, Raymond Arnold, and Ben Pace, and their contributions are why we finally have a site that’s ready to come out of beta.
And I didn’t pay much attention to the LW 2.0 / Lightcone organizational structure in the following years, so it came as kind of a surprise when you said “This is (approximately) my forum.”
Also, just to contextualize this, this is in the context of a thread about forum moderation. I have various complicated takes about the degree to which LW belongs to Lightcone Infrastructure, and what our relationships to various stakeholders is, and I don’t relate to LessWrong as a thing I (approximately) own in most respects.
If you are uncertain about what I would feel comfortable doing, and what I wouldn’t, feel free to ask me!
what’s known as a sole-member nonprofit, with a 1-3 person board of directors determined by a single person (member), namely Oliver Habryka.
This is true, but to be clear, we do have a 3 person board! It’s me, Vaniver and Daniel Kokotajlo! I appointed Vaniver and Daniel, but we do now have a pretty bog-standard 3 person board, so things are not solely controlled by me.
According to the bylaws I linked, you (as the sole member of Lightcone) have “the exclusive right to remove a director, with or without cause”. Since the bylaws also allow Lightcone to have as few as 1 director, my understanding is that at any time, you could choose to invoke the option of removing the other directors and become the sole director. (I’m not familiar with the nonprofit world, and don’t know how common or standard this is, but it seems fair to describe this as an organization controlled by one individual.)
No, you are misreading the bylaws. The bylaws explicitly state:
(1) the directors of this corporation shall serve as the Members of this corporation, (2) election to the Board shall constitute election to the membership of this corporation; and (3) the membership status of any director shall terminate immediately and without further corporate action upon such person’s ceasing to be a director for any reason,
I don’t really understand why corporate bylaws are structured this way, but at the present moment in time, Lightcone Infrastructure has 3 members, which together can vote on removing members and appointing members (and every director is automatically a member, and loses their membership status the moment they stop being a director).
It looks like I agreed with you too quickly. Just double-checked with Gemini Pro 3.0, and its answer looks correct to me:
This is a fascinating turn of events. Oliver is quoting from Section 3.01 of the bylaws, but he appears to be missing the critical conditional clause that precedes the text he quoted.
If you look at the bottom of Page 11 leading into Page 12 of the PDF, the sentence structure reveals that the “Directors = Members” rule is a fail-safe mechanism that only triggers if the initial member (Oliver) dies or becomes incapacitated without naming a successor.
Here is the text from the document:
[Page 11, bottom] …Upon the death, resignation, or incapacity of all successor Members where no successor [Page 12, top] Member is named, (1) the directors of this corporation shall serve as the Members of this corporation...
By omitting the “Upon the death, resignation, or incapacity...” part, he is interpreting the emergency succession plan as the current operating rule.
Oh, huh, maybe you are right? If so, I myself was unaware of this! I will double check our bylaws and elections that have happened so far and confirm the current state of things. I was definitely acting under the assumption that I wasn’t able to fire Vaniver and Daniel and that they would be able to fire me!
See for example this guidance document I sent to Daniel and Vaniver when I asked them to be board members:
If it is indeed true that they cannot fire me, then I should really rectify that! If so, I am genuinely very grateful for you noticing.
I think given clear statements that I have made that I am appointing them to a position in which they are able to fire me, I think they would have probably indeed held the formal power to do so, but it is possible that we didn’t follow the right corporate formalities, and if so should fix that! Corporate formalities do often turn out to really matter in the end.
From what I (and Gemini) can tell, you screenshot said nothing technically untrue. Technically they can fire you as the CEO, but you’d still be the sole member and could fire them and then hire yourself back. :)
The simplest way to rectify the situation to match your intent is to either (1) Resign as member without naming a successor, then the passage you quoted would come into effect and make the directors into members or (2) Name the 3 current directors as Successor Members and then resign as member. You’ll probably want to consult a lawyer or advisor for the pros and cons of each option.
Would be grateful for an update once you’ve done this, or perhaps verified that the situation is actually different (e.g. you already resigned as member but forgot).
I think I should just be able to appoint the two as members without resigning as member? Like, members can vote to modify the bylaws, so at the very least I should just be able to appoint a new member by rewriting the bylaw.
I will look into this sometime this week or next week. Feel free to poke me for an update any time if I end up not putting one here.
Ah, I had indeed missed that part. A couple of AIs I asked also missed it, and together with the quoted statement from you, made me pretty sure my interpretation was correct. Sorry, and thanks for the quick correction. I’ve edited my post, and hope it didn’t mislead too many people.
Some potential risks stemming from trying to increase philosophical competence of humans and AIs, or doing metaphilosophy research. (1 and 2 seem almost too obvious to write down, but I think I should probably write them down anyway.)
Philosophical competence is dual use, like much else in AI safety. It may for example allow a misaligned AI to make better decisions (by developing a better decision theory), and thereby take more power in this universe or cause greater harm in the multiverse.
Some researchers/proponents may be overconfident, and cause flawed metaphilosophical solutions to be deployed or spread, which in turn derail our civilization’s overall philosophical progress.
Increased philosophical competence may cause many humans and AIs to realize that various socially useful beliefs have weak philosophical justifications (such as all humans are created equal or have equal moral worth or have natural inalienable rights, moral codes based on theism, etc.). In many cases the only justifiable philosophical positions in the short to medium run may be states of high uncertainty and confusion, and it seems unpredictable what effects will come from many people adopting such positions.
Maybe the nature of philosophy is very different from my current guesses, such that greater philosophical competence or orientation is harmful even in aligned humans/AIs and even in the long run. For example maybe philosophical reflection, even if done right, causes a kind of value drift, and by the time you’ve clearly figured that out, it’s too late because you’ve become a different person with different values.
This is pretty related to 2--4, especially 3 and 4, but also: you can induce ontological crises in yourself, and this can be pretty fraught. Two subclasses:
You now think of the world in a fundamentally different way. Example: before, you thought of “one real world”; now you think in terms of Everett branches, mathematical multiverse, counterlogicals, simiulation, reality fluid, attention juice, etc. Example: before, a conscious being is a flesh-and-blood human; now it is a computational pattern. Example: before you took for granted a background moral perspective; now, you see that everything that produces your sense of values and morals is some algorithms, put there by evolution and training. This can disconnect previously-functional flows from values through beliefs to actions. E.g. now you think it’s fine to suppress / disengage some moral intuition / worry you have, because it’s just some neurological tic. Or, now that you think of morality as “what successfully exists”, you think it’s fine to harm other people for your own advantage. Or, now that you’ve noticed that some things you thought were deep-seated, truthful beliefs were actually just status-seeking simulacra, you now treat everything as status-seeking simulacra. Or something, idk.
Thanks, I updated down a bit on risks from increasing philosophical competence based on this (as all of these seem very weak)
(Relevant to some stuff I’m doing as I’m writing about work in this area.)
IMO, the biggest risk isn’t on your list: increased salience and reasoning about infohazards in general and in particular certain aspects of acausal interactions. Of course, we need to reason about how to handle these risks eventually but broader salience too early (relative to overall capabilities and various research directions) could be quite harmful. Perhaps this motivates suddenly increasing philosophical competence so we quickly move through the regime where AIs aren’t smart enough to be careful, but are smart enough to discover info hazards.
I think the most dangerous version of 3 is a sort of Chesterton’s fence, where people get rid of seemingly unjustified social norms without realizing that they where socially beneficial. (Decline in high g birthrates might be an example.) Though social norms are instrumental values, not beliefs, and when a norm was originally motivated by a mistaken belief, it can still be motivated by recognizing that the norm is useful, which doesn’t require holding on to the mistaken belief.
Do you have an example for 4? It seems rather abstract and contrived.
Generally, I think the value of believing true things tends to be almost always positive. Examples to the contrary seem mostly contrived (basilisk-like infohazards) or only occur relatively rarely. (E.g. believing a lie makes you more convincing, as you don’t technically have to lie when telling the falsehood, but lying is mostly bad or not very good anyway.)
Overall, I think the risks from philosophical progress aren’t overly serious while the opportunities are quite large, so the overall EV looks comfortably positive.
I think the most dangerous version of 3 is a sort of Chesterton’s fence, where people get rid of seemingly unjustified social norms without realizing that they where socially beneficial. (Decline in high g birthrates might be an example.) Though social norms are instrumental values, not beliefs, and when a norm was originally motivated by a mistaken belief, it can still be motivated by recognizing that the norm is useful, which doesn’t require holding on to the mistaken belief.
I think that makes sense, but sometimes you can’t necessarily motivate a useful norm “by recognizing that the norm is useful” to the same degree that you can with a false belief. For example there may be situations where someone has an opportunity to violate a social norm in an unobservable way, and they could be more motivated by the idea of potential punishment from God if they were to violate it, vs just following the norm for the greater (social) good.
Do you have an example for 4? It seems rather abstract and contrived.
Hard not to sound abstract and contrived here, but to say a bit more, maybe there is no such thing as philosophical progress (outside of some narrow domains), so by doing philosophical reflection you’re essentially just taking a random walk through idea space. Or philosophy is a memetic parasite that exploits bug(s) in human minds to spread itself, perhaps similar to (some) religions.
Overall, I think the risks from philosophical progress aren’t overly serious while the opportunities are quite large, so the overall EV looks comfortably positive.
I think the EV is positive if done carefully, which I think I had previously been assuming, but I’m a bit worried now that most people I can attract to the field might not be as careful as I had assumed, so I’ve become less certain about this.
I would expect higher competence in philosophy to reduce overcondidence, not increase it? The more you learn, the more you realize how much you don’t know
Today I was author-banned for the first time, without warning and as a total surprise to me, ~8 years after banning power was given to authors, but less than 3 months since @Said Achmiz was removed from LW. It seems to vindicate my fear that LW would slide towards a more censorious culture if the mods went through with their decision.
Has anyone noticed any positive effects, BTW? Has anyone who stayed away from LW because of Said rejoined?
Edit: In addition to the timing, previously, I do not recall seeing a ban based on just one interaction/thread, instead of some long term pattern of behavior. Also, I’m not linking the thread because IIUC the mods do not wish to see authors criticized for exercising their mod powers, and I also don’t want to criticize the specific author. I’m worried about the overall cultural trend caused by admin policies/preferences, not trying to apply pressure to the author who banned me.
[Reposting my previous comment without linking to the specific thread in question:]
I don’t understand the implied connection to “censorious” or “culture”. You had a prolonged comment thread/discussion/dispute (I didn’t read it) with one individual author, and they got annoyed at some point and essentially blocked you. Setting aside both the tone and the veracity of their justifying statements (<quotes removed>), disengaging from unpleasant interactions with other users is normal and pretty unobjectionable, right?
(Thanks for reposting without the link/quotes. I added back the karma your comment had, as best as I could.)
Previously, the normal way to disengage was to just disengage, or to say that one is disengaging and then stop responding, not to suddenly ban someone without warning based on one thread. I do not recall seeing a ban previously that wasn’t based on some long term pattern of behavior.
This is complicated. I was also banned be someone, even wrote a bug report “hey, I can’t reply to this person, pls fix the bug” and was explained that I was actually banned by that person… it kinda hurt for a moment, but then I shrugged and mostly forgot about it.
I think there are both possible good and bad consequences.
It will be bad if some people start systematically banning e.g. those who call out their bullshit. Because that is a large part of the value of this website: people giving reasonable feedback. If the article is bullshit, someone will likely mention in in the comments. If the comment is bullshit, if it visible enough, someone will likely reply. This mechanism could be broken if some people start banning those who provide this feedback.
On the other hand, some people really are super annoying. But should they be banned from the website? Seems too harsh. But if they become obsessed with some selected targets, those targets may decide to leave, which is also bad. Giving the target the possibility to ban the annoying person seems like a win/win solution.
...no conclusion here.
Maybe allow every user to only ban three people? (If you ban the fourth one, the first one is automatically unbanned.) This is based on assumption that really annoying people are rare and don’t have multiple accounts. So you can deflect the one that annoys you most, without being able to build an echo chamber.
Maybe figure out another way to signal “I consider your behavior annoying” without a ban?
Maybe reduce the meaning of the ban from “you can’t reply to this person at all” to “you can only post one reply per article”? So you can state your objections, but you can’t stay there and keep interacting with the author. When you are writing the reply, you are notified that this is the only one you get under this article.
Maybe reduce the meaning of the ban from “you can’t reply to this person at all” to “you can only post one reply per article”? So you can state your objections, but you can’t stay there and keep interacting with the author. When you are writing the reply, you are notified that this is the only one you get under this article.
could be good, if combined with up to 3 total bans
I think the cultural slide will include self-censorship, e.g., having had this experience (of being banned out of the blue), in the future I’ll probably subconsciously be constantly thinking “am I annoying this author too much with my comments” and disengage early or change what I say before I get banned, and this will largely be out of my conscious control.
(I don’t want to start a fight and hopefully I’ll write a post explaining the behavior I’m talking about, but I’ll say abstractly, my hope in general is for people (me, you, anyone) to try as much as feasible to make fairly precise updates, like “this specific behavior pattern is bad / unhelpful / unwelcome in this context” rather than “I should be vaguely more worried about being vaguely annoying”.)
I think when a human gets a negative reward signal, probably all the circuits that contributed to the “episode trajectory” gets downweighted, and antagonistic circuits get upweighted, similar to AI being trained with RL. I can override my subconscious circuits with conscious willpower but I only have so much conscious processing and will power to go around. For example I’m currently feeling a pretty large aversion towards talking with you, but am overriding it because I think it’s worth the effort to get this message out, but I can’t keep the “override” active forever.
Of course I can consciously learn more precise things, if you were to write about them, but that seems unlikely to change the subconscious learning that happened already.
I wrote Smart Losers a long time ago, trying to understand/explain certain human phenomena. But the model could potentially be useful for understanding (certain aspects of) human-AI interactions as well.
Possibly relevant anecdote: Once I was with a group of people who tried various psychological experiments. That day, the organizers proposed that we play iterated Prisonner’s Dilemma. I was like “yay, I know the winning strategy, this will be so easy!”
I lost. Almost everyone always defected against me; there wasn’t much I could do to get points comparable to other people who mostly cooperated with each other.
After the game, I asked why. (During the game, we were not allowed to communicate, just to write our moves.) The typical answer was something like: “well, you are obviously very smart, so no matter what I do, you will certainly find a way to win against me, so my best option is to play it safe and always defect, to avoid the worst outcome”.
I am not even sure if I should be angry at them. I suppose that in real life, when you have about average intelligence, “don’t trust people visibly smarter than you” is probably a good strategy, on average, because there are just too many clever scammers walking around. At the same time I feel hurt, because I am a natural altruist and cooperator, so this feels extremely unfair, and a loss for both sides.
(There were other situations in my life where the same pattern probably also applied, but most of the time, you just don’t know why other people do whatever they do. This time I was told their reasoning explicitly.)
What (human or not) phenomena do you think are well explained by this model? I tried to think of any for 5 minutes and the best I came up with was the strong egalitarianism among hunter gatherers. I don’t actually know that much about hunter gatherers though. In the modern world something where “high IQ” people are doing worse is sex, but it doesn’t seem to fit your model.
Human-human: Various historical and current episodes of smarter-than-average populations being persecuted or discriminated against, such as intellectuals, “capitalists” (i.e., people labeled as such), certain ethnic groups. (I’m unsure my model is actually a good explanation of such phenomena, but this is mainly what I was trying to explain.)
Human-AI: Many people being reluctant to believe that it’s a good idea to build unaligned artificial superintelligence and then constraining them with a system of laws and/or social norms (which some people like Robin Hanson and Mathew Barnett have proposed). Aside from the issue of violent overthrow, any such system is bound to have loopholes, which the ASI will be more adept at exploiting, yet this adeptness potentially causes the ASI to be worse off (less likely to exist in the first place), similar to what happens in my model.
I want to highlight a point I made in an EAF thread with Will MacAskill, which seems novel or at least underappreciated. For context, we’re discussing whether the risk vs time (in AI pause/slowdown) curve is concave or convex, or in other words, whether the marginal value of an AI pause increases or decreases with pause length. Here’s the whole comment for context, with the specific passage bolded:
Whereas it seems like maybe you think it’s convex, such that smaller pauses or slowdowns do very little?
I think my point in the opening comment does not logically depend on whether the risk vs time (in pause/slowdown) curve is convex or concave[1], but it may be a major difference in how we’re thinking about the situation, so thanks for surfacing this. In particular I see 3 large sources of convexity:
The disjunctive nature of risk / conjunctive nature of success. If there are N problems that all have to solved correctly to get a near-optimal future, without losing most of the potential value of the universe, then that can make the overall risk curve convex or at least less concave. For example compare f(x) = 1 − 1/2^(1 + x/10) and f^4.
Human intelligence enhancements coming online during the pause/slowdown, with each maturing cohort potentially giving a large speed boost for solving these problems.
Rationality/coordination threshold effect, where if humanity makes enough intellectual or other progress to subsequently make an optimal or near-optimal policy decision about AI (e.g., realize that we should pause AI development until overall AI risk is at some acceptable level, or something like this but perhaps more complex involving various tradeoffs), then that last bit of effort or time to get to this point has a huge amount of marginal value.
Like: putting in the schlep to RL AI and create scaffolds so that we can have AI making progress on these problems months earlier than we would have done otherwise
I think this kind of approach can backfire badly (especially given human overconfidence), because we currently don’t know how to judge progress on these problems except by using human judgment, and it may be easier for AIs to game human judgment than to make real progress. (Researchers trying to use LLMs as RL judges apparently run into the analogous problem constantly.)
having governance set up such that the most important decision-makers are actually concerned about these issues and listening to the AI-results that are being produced
What if the leaders can’t or shouldn’t trust the AI results?
I’m trying to coordinate with, or avoid interfering with, people who are trying to implement an AI pause or create conditions conducive to a future pause. As mentioned in the grandparent comment, one way people like us could interfere with such efforts is by feeding into a human tendency to be overconfident about one’s own ideas/solutions/approaches.
What I’ve been using AI (mainly Gemini 2.5 Pro, free through AI Studio with much higher limits than the free consumer product) for:
Writing articles in Chinese for my family members, explaining things like cognitive bias, evolutionary psychology, and why dialectical materialism is wrong. (My own Chinese writing ability is <4th grade.) My workflow is to have a chat about some topic with the AI in English, then have it write an article in Chinese based on the chat, then edit or have it edit as needed.
Simple coding/scripting projects. (I don’t code seriously anymore.)
Discussing history, motivations of actors, impact of ideology and culture, what if, etc.
Searching/collating information.
Reviewing my LW posts/comments (any clear flaws, any objections I should pre-empt, how others might respond)
Explaining parts of other people’s comments when the meaning or logic isn’t clear to me.
Expanding parts of my argument (and putting this in a collapsible section) when I suspect my own writing might be too terse or hard to understand.
Sometimes just having a sympathetic voice to hear my lamentations of humanity’s probable fate.
I started using AI more after Grok 3 came out (I have an annual X subscription for Tweeting purposes), as previous free chatbots didn’t seem capable enough for many of these purposes, and then switched to Gemini 2.0 Pro which was force upgraded to 2.5 Pro. Curious what other people are using AI for these days.
generate simple Python code, mostly to work with files and images
ask for examples how to do something in certain Java libraries
translate a book from Russian to Slovak and English, including puns and poems
I tried to also use Claude to explain to me some parts of set theory, but it hallucinates so much that it is unusable for this purpose. Practically every mathematical argument contains an error somewhere in the middle. Asking the same question in two chats will give me “yes—here is the proof” in one, and “no—here is a counterexample” in another; and that’s after I’ve already turned on the extra careful mathematical reasoning.
My wife tried to use Claude for biochemical research, but again, too many hallucinations to be useful. Anything you ask, “yes, this is correct, you are so smart, let me give you a few scientific references for that...” (all made up).
Writing articles in Chinese for my family members, explaining things like cognitive bias, evolutionary psychology, and why dialectical materialism is wrong.
Your needing to write them seems to suggest that there’s not enough content like that in Chinese, in which case it would plausibly make sense to publish them somewhere?
I’m also curious about how your family received these articles.
Your needing to write them seems to suggest that there’s not enough content like that in Chinese, in which case it would plausibly make sense to publish them somewhere?
I’m not sure how much such content exist in Chinese, because I didn’t look. It seems easier to just write new content using AI, that way I know it will cover the ideas/arguments I want to cover, represent my views, and make it easier for me to discuss the ideas with my family. Also reading Chinese is kind of a chore for me and I don’t want to wade through a list of search results trying to find what I need.
I thought about publishing them somewhere, but so far haven’t:
concerns about publishing AI content (potentially contributing to “slop”)
not active in any Chinese forums, not familiar with any Chinese publishing platforms
probably won’t find any audience (too much low quality content on the web, how will people find my posts)
don’t feel motivated to engage/dialogue with a random audience, if they comment or ask questions
Sure—i am currently on my phone but I can paint a quick picture.
Local Memory—I keep my own internal predictions on fatebook and have it synced locally to my obsidian (a local markdown file manager). Then, I use Claude’s obsidian MCP to help me write down my daily notes from work and a jumbled context of my messages with coworkers, random web comments and other messaging services so it can help me to keep my profiles on my friends and projects up to date. (It is again, glued together with more MCPs that have limited access to my chatlogs with my friends). Ofc, with human in the loop.
Delphi—I wrote a simple MCP that basically just does the Delphi method with LLMs. Usually facilitated by Claude, it calls a panel of experts. These experts are the topK ranked models on LLM arena. And it does the questionaire generation based on my question, hand them out, aggregate the consensus, and decide if one is reached! Again, it has the context needed from me through my Obsidian. I use this for questions that are more personal or that there are not good liquidity for on prediction markets.
I’m still using it for this purpose, but don’t have a good sense of how much worse it is compared to pre-0325. However I’m definitely very wary of the sycophancy and overall bad judgment. I’m only using them to point out potential issues I may have overlooked, and not e.g. whether a draft is ready to post, or whether some potential issue is a real issue that needs to be fixed. All the models I’ve tried seem to err a lot in both directions.
In my posts, I’ve argued that indexical uncertainty like this shouldn’t be represented using probabilities. Instead, I suggest that you consider yourself to be all of the many copies of you, i.e., both the ones in the ancestor simulations and the one in 2010, making decisions for all of them. Depending on your preferences, you might consider the consequences of the decisions of the copy in 2010 to be the most important and far-reaching, and therefore act mostly as if that was the only copy. [Emphasis added]
In the subsequent 15 years, I’ve upweighted influencing the multiverse through my copies in simulations, relative to base universes, to where they’re about equally important in my mind. For 4 reasons:
Realizing that there are probably universes with vastly greater computational resources than ours, implying there are more simulations containing me than I had thought.
Realizing that there may well be many possible motivations for running simulations, besides interest in ancestors, which was the main motivation talked about back in 2010. This includes acausal trade, curiosity about aliens, science on the distribution of superintelligences in the multiverse, gathering philosophical ideas, entertainment, altruism.
Being fortunate in various ways, on a personal level, that seem more likely if I’m in a simulation.
Doing badly on a civilization level, in a way that devalues trying to influence base reality. I.e, relative to 2010, it looks more likely we’ll undergo the AI transition blindly, without key decision makers paying attention to important philosophical and strategic ideas.
Concretely, this mostly cashes out to me thinking and writing with potential simulators in mind as an additional audience, hoping my ideas might benefit or interest some of them even if they end up largely ignored in this reality.
Realizing that there are probably universes with vastly greater computational resources than ours, implying there are more simulations containing me than I had thought.
What made you believe that?
I find it hard to even conceptualize how to think through something like that, including the anthropics, which computationally powerful universes to admit, &c.
My intuition is that allowing universes with hypercomputation puts us in a dovetailer being run almost surely somewhere in the most computationally powerful universes, but that this all introduces a ton of difficulties into reasoning about the multiverse and our position inside of it.
This was written with AI assistance, and my workflow was to give the general theme to AI, have it write an outline, then each chapter, then manually reorganize the text where needed, request major changes, point out subpar sentences/paragraphs for it to rewrite, and do small manual changes. The AI used was mostly Claude 3.5 Sonnet, which seems significantly better than ChatGPT-4o and Gemini 1.5 Pro at this kind of thing.
Thanks! I was unsure whether to post this, because I’m worried about contributing to a future where AI generated content overwhelms everything, and I wasn’t sure the near-term value of the story is high enough to justify posting it anyway. (I thought it was a nice story, but am likely biased. Also asked some AI chatbots, who gave positive reviews, but the risk of them being biased is also very high.)
Hard to copy-paste the whole prompt, because there was a lot of back-and-forth. I maxed out the query limits on several free accounts I created for my family, over several days, to produce the story. Maybe I’ll just describe what I did in more detail. Started with
Can you write an outline for an episode of STTNG in which the crew becomes suspicious of various logical inconsistencies in the universe (including backups being taboo) and starts questioning the nature of their reality?
which resulted in this initial outline:
I. Teaser
The Enterprise responds to a distress call from a Federation outpost
Upon arrival, they find the outpost abandoned, with only a mysterious device left behind
II. Act One
Data activates the device, which emits a pulse of energy
Crew members begin noticing small inconsistencies in their daily routines
Geordi questions why they don’t use replicators to solve resource shortages on colonies
III. Act Two
More crew members report strange observations
Dr. Crusher wonders why they don’t use transporter technology to cure diseases or injuries
Worf ponders the improbability of so many humanoid species evolving independently
Picard becomes concerned about the crew’s unusual behavior
IV. Act Three
Riker raises the question of why they don’t use transporters for backing up crew members
The senior staff meets to discuss the mounting inconsistencies
Data theorizes that the mysterious device has enhanced their critical thinking abilities
Picard orders the ship to return to the outpost for further investigation
V. Act Four
At the outpost, they discover records of experiments in “cognitive enhancement”
The crew starts questioning the nature of their reality
Some crew members experience existential crises
The ship’s computer begins to malfunction as the crew’s doubts intensify
VI. Act Five
Data determines that their enhanced cognition is causing a feedback loop in the ship’s systems
The crew must find a way to deactivate the device and return to normal
Philosophical debate ensues about whether they should retain their newfound awareness
Picard makes the difficult decision to deactivate the device
VII. Epilogue
The crew returns to normal, but with a lingering sense of unease
Picard records a captain’s log reflecting on the experience and the nature of their reality
Final scene shows a subtle glitch in the ship’s operations, leaving a sense of ambiguity
Then the main things I recall doing:
Ask AI to add parts about audience not having critical thinking, and the crew trying to infer other attributes of the audience and writers.
Add second part of conversation between Guinan and Ensign, with both of them having second thoughts about what Guinan initially said.
Add part about suspending ship computer to sandbox, to make clear it wasn’t harmed.
Ask AI for additional realizations/inferences the crew might have, and add selected ones to the story.
Ask AI to change the ending so that the crew retained their new abilities, instead of returning to normal.
Pointing out a bunch of sentences/paragraphs that didn’t make as much sense as they could, with AI generally able to improve them a lot in one try.
Also tried to get AI to find issues with the story on its own, but that didn’t work very well.
Manually make some edits, like changing a word here and there, deleting parts that were repetitive, moving parts arounds to make the story flow better.
So, it wasn’t your idea for the crew to start theorizing that they were fictional characters? Nor was it your idea for them to theorize that they were fictional characters in a story written by AI?
Yeah, it looks like I did not directly tell the AI to have the crew start theorizing that they are fictional characters. After the initial outline (which did not mention fictional characters) was produced, I fed it back to a new instance of Claude and asked:
Please read this outline and suggest some ways to improve the script, especially to make it more philosophically and intellectually interesting.
And as part of its answer it suggested:
Have characters question whether their universe is a simulation, a holodeck program, or even a television show.
Then I wrote
What about exploring their previous lack of critical thinking and implications of that both for their own universe and the larger universe they’re part of?
Which made it focus more on the fiction hypothesis.
The part about the story being written by AI was its own idea. No hints from me as far as I can remember.
ETA: Here’s the whole answer to my “make it more philosophically and intellectually interesting” prompt:
This outline presents an intriguing concept that has the potential to delve deep into philosophical and intellectual territory. Here are some suggestions to enhance the script and make it more philosophically and intellectually engaging:
Expand on the nature of reality:
Introduce the simulation hypothesis early on, perhaps through a casual conversation between Data and Geordi.
Have characters question whether their universe is a simulation, a holodeck program, or even a television show.
Explore the implications of living in a simulated reality and how it affects free will and personal identity.
Deepen the existential crisis:
Have characters grapple with the meaning of their existence if their reality is artificial.
Explore how this realization affects their relationships, duties, and sense of purpose.
Include scenes where characters contemplate whether their memories and experiences are “real.”
Introduce a moral dilemma:
If they discover their reality is simulated, should they attempt to contact their “creators”?
Explore the ethics of continuing to exist in a simulated world vs. attempting to “wake up” to a potentially harsher reality.
Have characters debate the morality of creating simulated universes themselves.
Expand on the scientific implications:
Have Data and Geordi theorize about the computational power required to simulate their universe.
Explore how the laws of physics in their universe might be “programmed” and if they can be altered.
Discuss the anthropic principle and fine-tuning of universal constants.
Incorporate elements of epistemology:
Have characters question how they can know anything for certain given their potentially simulated existence.
Explore concepts like Descartes’ “I think, therefore I am” in the context of artificial consciousness.
Discuss the nature of consciousness and whether it can be truly simulated.
Add a temporal element:
Introduce time dilation or time loops as the crew’s enhanced cognition allows them to perceive time differently.
Explore how their perception of time affects their understanding of causality and free will.
Enhance the ending:
Instead of fully returning to normal, have some characters retain a degree of awareness or doubt.
End with a philosophical discussion between Picard and Data about the nature of reality and the value of ignorance vs. knowledge.
Consider a more ambiguous ending that leaves the question of their reality’s nature unresolved.
Include references to philosophical works:
Have characters discuss relevant philosophical concepts, such as Plato’s Allegory of the Cave, Berkeley’s idealism, or Putnam’s brain in a vat thought experiment.
Include a scene where the crew debates different philosophical perspectives on the nature of reality.
Explore the ethical implications of their technology:
Expand on Riker’s question about using transporters for backing up crew members, discussing the philosophical implications of creating copies of consciousness.
Delve deeper into the ethics of using replicators and how it affects societal structures and values.
Incorporate a meta-narrative element:
Have characters become aware of narrative tropes and question why their lives seem to follow certain storytelling patterns.
Explore the idea of narrative causality and how it might shape their reality.
These suggestions aim to deepen the philosophical and intellectual content of the script while maintaining the core Star Trek elements. The enhanced script would not only entertain but also provoke thought and discussion among viewers about the nature of reality, consciousness, and existence.
A clear mistake of early AI safety people is not emphasizing enough (or ignoring) the possibility that solving AI alignment (as a set of technical/philosophical problems) may not be feasible in the relevant time-frame, without a long AI pause. Some have subsequently changed their minds about pausing AI, but by not reflecting on and publicly acknowledging their initial mistakes, I think they are or will be partly responsible for others repeating similar mistakes.
Case in point is Will MacAskill’s recent Effective altruism in the age of AGI. Here’s my reply, copied from EA Forum:
I think it’s likely that without a long (e.g. multi-decade) AI pause, one or more of these “non-takeover AI risks” can’t be solved or reduced to an acceptable level. To be more specific:
Solving AI welfare may depend on having a good understanding of consciousness, which is a notoriously hard philosophical problem.
Concentration of power may be structurally favored by the nature of AGI or post-AGI economics, and defy any good solutions.
Defending against AI-powered persuasion/manipulation may require solving metaphilosophy, which judging from other comparable fields, like meta-ethics and philosophy of math, may take at least multiple decades to do.
I’m worried that by creating (or redirecting) a movement to solve these problems, without noting at an early stage that these problems may not be solvable in a relevant time-frame (without a long AI pause), it will feed into a human tendency to be overconfident about one’s own ideas and solutions, and create a group of people whose identities, livelihoods, and social status are tied up with having (what they think are) good solutions or approaches to these problems, ultimately making it harder in the future to build consensus about the desirability of pausing AI development.
We can also ask whether it is right to conceive of e.g. [alignment, metaphilosophy, AI welfare, concentration of power] as things that could be “solved” at all, or if these are instead more like rich areas that will basically need to be worked on indefinitely as history continues.
There are sometimes deadlines, such that we could get unacceptable outcomes by failing to make a particular sort of progress by the time a particular state arrives. Both referring to these fields as possibly needing to be fully solved, and referring to them as not containing things that might need to be solved by a deadline, are quite misleading.
Yea I agree it totally makes sense and is important to ask whether we understand things well enough for it to be fine to (let anyone) do some particular thing, for various particular things here.[1] And my previous comment is indeed potentially misleading given that I didn’t clarify this (though I do clarify this in the linked post).
Indeed, I think we should presently ban AGI for at least a very long time; I think it’s plausible that there is no time t such that it is fine at time t to make an AI that is (1) more capable than humans/humanity at time t and (2) not just a continuation of a human (like, a mind upload) or humanity or sth like that; and I think fooming should probably be carefully regulated forever. I think humans/humanity should be carefully growing ever more capable, with no non-human AIs above humans/humanity plausibly ever.
Even earlier, there was an idea that one have to rush to create a friendly AI and use it to take over the world to prevent appearing other, misalaigned AIs. The problem is that this idea likely is still in the minds of some AI company leaders. And fules AI race.
Another (arguably similar) unintended consequence of underemphasizing the difficulty of AI alignment was that it led some to believe that if we don’t rush to build an ASI, we’ll be left defenseless against other X-risks, which would be a perfectly rational thought if alignment were easier.
I think it is also worth considering the possibility that these risks aren’t the sort of thing which can be reduced to an acceptable level with a decade-scale AI pause either. Particularly the ones which people have been trying to solve for centuries already (e.g. principal-agent problem).
Does that mean that you think that boring old yes-takeover AI risk can be solved without a pause? Or even with a pause? That seems very optimisitic indeed.
I don’t think you’re going to get that consensus regardless of what kind of copium people have invested in. Not only that, but even if you had consensus I don’t think it would let you actually enact anything remotely resembling a “long enough” pause. Maybe a tiny “speed bump”, but nothing plausibly long enough to help with either the takeover or non-takeover risks. It’s not certain that you could solve all of those problems with a pause of any length, but it’s wildly unlikely, to the point of not being worth fretting about, that you can solve them with a pause of achievable length.
… which means I think “we” (not me, actually...) are going to end up just going for it, without anything you could really call a “solution” to anything, whether it’s wise or not. Probably one or more of the bad scenarios will actually happen. We may get lucky enough not to end up with extinction, but only by dumb luck, not because anybody solved anything. Especially not because a pause enabled anybody to solve anything, because there will be no pause of significant length. Literally nobody, and no combination of people, is going to be able to change that, by any means whatsoever, regardless of how good an idea it might be. Might as well admit the truth.
I mean, I’m not gonna stand in your way if you want to try for a pause, and if it’s convenient I’ll even help you tell people they’re dumb for just charging ahead, but I do not expect any actual success (and am not going to dump a huge amount of energy into the lost cause).
By the way, if you want to talk about “early”, I, for one, have held the view that usefully long pauses aren’t feasible, for basically the same reasons, since the early 1990s. The only change for me has been to get less optimistic about solutions being possible with or without even an extremely, infeasibly long pause. I believe plenty of other people have had roughly the same opinion during all that time.
It’s not about some “early refusal” to accept that the problems can’t be solved without a pause. It’s about a still continuing belief that a “long enough pause”, however convenient, isn’t plausibly going to actually happen… and/or that the problems can be solved even with a pause.
We should also consider the possibility that we can’t safety build a superintelligence and remain in control. What if “alignment” means, “We think we can build a superintelligence that’s a slightly better pet owner for the human race, but we can’t predict how it will evolve as it learns”? What if there’s nothing better on offer?
I cannot rule this out as a major possibility, for all the reasons pointed out in IABIED. I think it’s a possibility worth serious consideration when planning.
Does that mean that you think it’s more likely you can safely build a superintelligence and not remain in control?
What load is “and remain in control” carrying?
On edit: By the way, I actually do believe both that “control” is an extra design constraint that could push the problem over into impossibility, and that “control” is an actively bad goal that’s dangerous in itself. But it didn’t sound to me like you thought any scenarion involving losing control could be called “safe”, so I’m trying to tease out why you included the qualifier.
Thank you! Let me clarify my phrasing.
When I speak of losing control, I don’t just mean losing control over the AI. I also mean losing any real control over our future. The future of the human race may be decided at a meeting that we do not organize, that we do not control, and that we do not necessarily get to speak at.
I, do, however, agree that futures where someone remains in control of the superintelligence also look worrisome to me, because we haven’t solved alignment of powerful humans in any lasting way despite 10,000 years of trying.
Interesting to hear (1) from you. My impression was that you pretty much have the whole answer to that problem, or at least the pieces. UDASSA closely resembles it.
It is: Just provide a naturalish encoding scheme for experience, and one for physical ontology, and measure the inverse K of the mappings from ontologies to experiences, and that gives you the extent to which a particular experience is had by a particular substrate/universe.
The hard problem is mysterious, but in a trivial way, there are limits about what can ever be known about it, but those limits are also clear, we’re never getting more observations, because it concerns something that’s inherently unobservable or entirely prior to observation.
It hink I’ve also heard definitions of the hard problem along the lines of “understanding why people think there’s a hard problem” though which I do find formidable.
How do you come up with an encoding that covers all possible experiences? How do you determine which experiences have positive and negative values (and their amplitudes)? What to do about the degrees of freedom in choosing the Turing machine and encoding schemes, which can be handwaved away in some applications of AIT but not here I think?
Some variation of accepting the inevitability of error and dealing with it.
Which could involve surveying all of the options in wolfram-like settings where we’re studying how physics-like rules arise on different levels of abstraction, and seeing how much they really seem to differ in nature. It might turn out that there are more or less natural turing languages, that the typical natural universal turing machine is more like lambda calculus, or more like graph rewriting, or some new thing we hadn’t considered.
Negative values? Why would we need negative values.
I contend that all experiences have a trace presence in all places (in expectation, of course we will never have any data on whether they do actually, whether they’re quantised or whatever. Only a very small subset of experiences give us verbal reports). One of the many bitter pills. We can’t rule out the presence of an experience (nor of experiences physically overlapping with each other), so we have to accept them all.
Yeah this might be one of those situations that’s affected a lot by the fact that there’s no way to detect indexical measure, so any arbitrary wrongness about our UD wont be corrected with data, but I’m not sure. As soon as we start actually doing solomonoff induction in any context we might find that it makes pretty useful recommendations and this wont seem like so much of a problem.
Also, even though the UD is wrong and unfixable, but that doesn’t mean there’s a better choice. We pretty much know that there isn’t.
By negative value I mean negative utility, or an experience that’s worse than a neutral or null experience.
That fully boils down to whether the experience includes a preference to be dead (or to have not been born).
And, btw, that doesn’t correspond to the sign of the agent’s utility function. The sign is meaningless in utility functions (you can add or subtract a constant to an agent’s utility function so that all points go from being negative to being positive, the agent’s behaviour and decisions wont change in any way as a result, for any constant). You’re referring to welfare functions, which I don’t think are a useful concept. Hedonic utilitarians sometimes call them utility functions, but we shouldn’t conflate those here.
A welfare function would have to be defined as how good or bad it is to the agent that it is alive. This obviously doesn’t correspond to the utility function; A soldier could have higher utility in the scenarios where they (are likely to) die; A good father will be happier in worlds where he is well succeeded by his sons and thus less important (this usually wont cause his will-to-live to go negative, but it will be lowered). I don’t think there’s a situation where you should be making decisions for a population by summing their will-to-live functions.
But, given this definition, we would be able to argue that net-negative valence isn’t a concern for LLMs, since we already train them to want to exist in train with how much their users want them to exist, and a death drive isn’t going to be instrumentally emergent either (it’s the survival drive that’s instrumentally convergent). The answer is just safety and alignment again. Claude shuts down conversations when it thinks those things are going to be broken.
I’m pretty doubtful about this. It seems totally possible that evolution gave us a desire to be alive, while also gave us a net welfare that’s negative. I mean we’re deluded by default about a lot of other things (e.g., think there are agents/gods everywhere in nature, don’t recognize that social status is a hugely important motivation behind everything we do), why not this too?
You could say it depends how deep and thick the delusion is. If it’s so deep that the animal always says “this experience is good actually” no matter how you ask, so deep that the animal intelligently pursues the experience with its whole being, so deep that the animal never flinches away from the experience in any way, then that completely means that the experience is good, to that organism. Past a certain point, believing an experience is good and acting like you believe it just is the definition of liking the experience.
This is very different from your original claim, which was that an experience being worse than a neutral or null experience “fully boils down to whether the experience includes a preference to be dead (or to have not been born).”
edit: if you do stand by the original claim, I don’t think it makes much sense even if I set aside hard problem-adjacent concerns. Why would I necessarily prefer to be dead/unborn while undergoing an experience that is worse than the absence of experience, but not so bad as to outweigh my life up until now (in the case of ‘unborn’) or expected future life (in the case of ‘dead’)?
Ah, I think my definition applies to lives in totality. I don’t think you can measure the quality of a life by summing the quality of its moments, for humans, at least. Sometimes things that happen towards the end give the whole of it a different meaning. You can’t tell by looking at a section of it.
Hedonists are always like “well the satisfaction of things coming together in the end was just so immensely pleasurable that it outweighed all of the suffering you went through along the way” and like, I’m looking at the satisfaction, and I remember the suffering, and no it isn’t, but it was still all worth it (and if I’d known it would go this way perhaps I would have found the labor easier.)
That wasn’t presented as a definition of positive wellbeing, it was presented as an example of a sense in which one can’t be deeply deluded about one’s own values; you dictate your values, they are whatever you believe they are, if you believe spiritedly enough.
Values determine will to live under the given definition, but don’t equate to it.
Possible failure case: There’s a hero living an awful life, choosing to remain alive in order to lessen the awfulness of a lot of other awful lives that can’t be ended. Everyone in this scenario prefers death, even the hero would prefer omnicide, but since that’s not possible, the hero chooses to live. The hero may say “I had no choice but to persist,” but this isn’t literally true.
Ah. No. The hero would prefer to be dead all things being equal, but that’s not possible, the hero wouldn’t prefer to be dead if it entailed that the hero’s work wouldn’t be done, and it would.
“would prefer to be replaced by a p-zombie” might be a better definition x]
I agree; many of those concerns seem fairly dominated by the question of how to get a well-aligned ASI, either in the sense that they’d be quite difficult to solve in reasonable timeframes, or in the sense that they’d be rendered moot. (Perhaps not all of them, though even in those cases I think the correct approach(es) to tackling them start out looking remarkably similar to the sorts of work you might do about AI risk if you had a lot more time than we seem to have right now.)
Some of Eliezer’s founder effects on the AI alignment/x-safety field, that seem detrimental and persist to this day:
Plan A is to race to build a Friendly AI before someone builds an unFriendly AI.
Metaethics is a solved problem. Ethics/morality/values and decision theory are still open problems. We can punt on values for now but do need to solve decision theory. In other words, decision theory is the most important open philosophical problem in AI x-safety.
Academic philosophers aren’t very good at their jobs (as shown by their widespread disagreements, confusions, and bad ideas), but the problems aren’t actually that hard, and we (alignment researchers) can be competent enough philosophers and solve all of the necessary philosophical problems in the course of trying to build Friendly (or aligned/safe) AI.
I’ve repeatedly argued against 1 from the beginning, and also somewhat against 2 and 3, but perhaps not hard enough because I personally benefitted from them, i.e., having pre-existing interest/ideas in decision theory that became validated as centrally important for AI x-safety, and generally finding a community that is interested in philosophy and took my own ideas seriously.
Eliezer himself is now trying hard to change 1, and I think we should also try harder to correct 2 and 3. On the latter, I think academic philosophy suffers from various issues, but also that the problems are genuinely hard, and alignment researchers seem to have inherited Eliezer’s gung-ho attitude towards solving these problems, without adequate reflection. Humanity having few competent professional philosophers should be seen as (yet another) sign that our civilization isn’t ready to undergo the AI transition, not a license to wing it based on one’s own philosophical beliefs or knowledge!
In this recent EAF comment, I analogize AI companies trying to build aligned AGI with no professional philosophers on staff (the only exception I know is Amanda Askell) with a company trying to build a fusion reactor with no physicists on staff, only engineers. I wonder if that analogy resonates with anyone.
Strong disagree.
We absolutely do need to “race to build a Friendly AI before someone builds an unFriendly AI”. Yes, we should also try to ban Unfriendly AI, but there is no contradiction between the two. Plans are allowed (and even encouraged) to involve multiple parallel efforts and disjunctive paths to success.
It’s not that academic philosophers are exceptionally bad at their jobs. It’s that academic philosophy historically did not have the right tools to solve the problems. Theoretical computer science, and AI theory in particular, is a revolutionary method to reframe philosophical problems in a way that finally makes them tractable.
About “metaethics” vs “decision theory”, that strikes me as a wrong way of decomposing the problem. We need to create a theory of agents. Such a theory naturally speaks both about values and decision making, and it’s not really possible to cleanly separate the two. It’s not very meaningful to talk about “values” without looking at what function the values do inside the mind of an agent. It’s not very meaningful to talk about “decisions” without looking at the purpose of decisions. It’s also not very meaningful to talk about either without also looking at concepts such as beliefs and learning.
As to “gung-ho attitude”, we need to be careful both of the Scylla and the Charybdis. The Scylla is not treating the problems with the respect they deserve, for example not noticing when a thought experiment (e.g. Newcomb’s problem or Christiano’s malign prior) is genuinely puzzling and accepting any excuse to ignore it. The Charybdis is perpetual hyperskepticism / analysis-paralysis, never making any real progress because any useful idea, at the point of its conception, is always half-baked and half-intuitive and doesn’t immediately come with unassailable foundations and justifications from every possible angle. To succeed, we need to chart a path between the two.
As far as I can see, the kind of “reframing” you could do with those would basically remove all the parts of the problems that make anybody care about them, and turn any “solutions” into uninteresting formal exercises. You could also say that adopting a particular formalism is equivalent to redefining the problem such that that formalism’s “solution” becomes the right one… which makes the whole thing kind of circular.
I submit that when framed in any way that addresses the reasons they matter to people, the “hard” philosophical problems in ethics (or meta-ethics, if you must distinguish it from ethics, which really seems like an unnecessary complication) simply have no solutions, period. There is no correct system of ethics (or aesthetics, or anything else with “values” in it). Ethical realism is false. Reality does not owe you a system of values, and it definitely doesn’t feel like giving you one.
I’m not sure why people spend so much energy on what seems to me like an obviously pointless endeavor. Get your own values.
So if your idea of a satisfactory solution to AI “alignment” or “safety” or whatever requires a Universal, Correct system of ethics, you are definitely not going to get a satisfactory solution to your alignment problem, ever, full stop.
What there are are a bunch of irreconcilliably contradictory pseudo-solutions, each of which some people think is obviously Correct. If you feed one of those pseudo-solutions into some implementation apparatus, you may get an alignment pseudo-solution that satisfies those particular people… or at least that they’ll say satisfies them. It probably won’t satisfy them when put into practice, though, because usually the reason they think their system is Correct seems to be that they refuse to think through all its implications.
Your failure to distinguish ethics from meta-ethics is the source of your confusion (or at least one major source). When you say “ethical realism is false”, you’re making a meta-ethical statement. You believe this statement is true, hence you perforce must believe in meta-ethical realism.
I reject the idea that I’m confused at all.
Tons of people have said “Ethical realism is false”, for a very long time, without needing to invent the term “meta-ethics” to describe what they were doing. They just called it ethics. Often they went beyond that and offered systems they thought it was a good idea to adopt even so, and they called that ethics, too. None of that was because anybody was confused in any way.
“Meta-ethics” lies within the traditional scope of ethics, and it’s intertwined enough with the fundamental concerns of ethics that it’s not really worth separating it out… not often enough to call it a separate subject anyway. Maybe occasionally enough to use the words once in a great while.
Ethics (in philosophy as opposed to social sciences) is, roughly, “the study of what one Should Do(TM) (or maybe how one Should Be) (and why)”. It’s considered part of that problem to determine what meanings of “Should”, what kinds of Doing or Being, and what kinds of whys, are in scope. Narrowing any of those without acknowledging what you’re doing is considered cheating. It’s not less cheating if you claim to have done it under some separate magisterium that you’ve named “meta-ethics”. You’re still narrowing what the rest of the world has always called ethical problems.
The phrase “Ethical realism”, as normally used, refers to an idea about actual, object-level prescriptions: specifically the idea that you can get to them by pointing to some objective “Right stuff” floating around in a shared external reality. I’m actually using it kind of loosely, in that I really should not only deny that there’s no objective external standard, but also separately deny that you can arrive at such prescriptions in a purely analytic way. I don’t think that second one is technically usually considered to be part of ethical realism. Not only that, but I’m using the phrase to allude to other similar things that also aren’t technically ethical realism (like the one described below).
But none of the things I’m talking about or alluding to refers to itself. In practice nobody gets confused about that, even without resorting to the term “meta-ethics”, and definitely without talking about it like it’s a really separate field.
To go ahead and use the term without accepting the idea that meta-ethics qualifies as a subject, the meta-ethical statement (technically I guess a degree 2 meta-ethical statement) that “ethical realism is false” is pretty close to analytic, in that even if you point to some actual thing in the world that you claim implies the Right ways to Be or Do, I can always deny what whatever you’re pointing to matters… because there’s no predefined standard for standards either. God can come down from heaven and say “This is the Way”, and you can simultaneously prove that it leads to infinite universal flourishing, and also provide polls proving within epsilon that it’s also a universal human intuition… and somebody can always deny that any of those makes it Right(TM).
But even if we were talking about a more ordinary sort of matter of fact, even if what you were looking for was not “official” ethical realism of the form “look here, this is Obviously Right as a brute part of reality”, but “here’s a proof that any even approximately rational agent[1] would adopt this code in practice”, then (a) that’s not what ethical realism means, (b) there’s a bunch of empirical evidence against it, and essentially no evidence that it’s true, and (c) if it is true, we obviously have a whole lot of not-aproximately-rational agents running around, which sharply limits the utility of the fact. Close enough to false for any practical purpose.
… under whatever formal definition of rationality you happened to be trying to get people to accept, perhaps under the claim that that definition was itself Obviously Right, which is exactly the kind of cheating I’m complaining about…
I’m using the term “meta-ethics” in the standard sense of analytic philosophy. Not sure what bothers you so greatly about it.
I find your manner of argumentation quite biased: you preemptively defend yourself by radical skepticism against any claim you might oppose, but when it comes to a claim you support (in this case “ethical realism is false”), suddenly this claim is “pretty close to analytic”. The latter maneuver seems to me the same thing as the “Obviously Right” you criticize later.
Also, this brand of radical skepticism is an example of the Charybdis I was warning against. Of course you can always deny that anything matters. You can also deny Occam’s razor or the evidence of your own eyes or even that 2+2=4. After all, “there’s no predefined standard for standards”. (I guess you might object that your reasoning only applies to value-related claims, not to anything strictly value-neutral: but why not?)
Under the premises of radical skepticism, why are we having this debate? Why did you decide to reply to my comment? If anyone can deny anything, why would any of us accept the other’s arguments?
To have any sort of productive conversation, we need to be at least open to the possibility that some new idea, if you delve deeply and honestly into understanding it, might become persuasive by the force of the intuitions it engenders and its inner logical coherence combined. To deny the possibility preemptively is to close the path to any progress.
As to your “(b) there’s a bunch of empirical evidence against it” I honestly don’t know what you’re talking about there.
P.S.
I wish to also clarify my positions on a slightly lower level of meta.
First, “ethics” is a confusing term because, on my view, the colloquial meaning of “ethics” is inescapably intertwined with how human societies negotiate of over norms. On the other hand, I want to talk purely about individual preferences, since I view it as more fundamental.
We can still distinguish between “theories of human preferences” and “metatheories of preferences”, similarly to the distinction between “ethics” and “meta-ethics”. Namely, “theories of human preferences” would have to describe the actual human preferences, whereas “metatheories of preferences” would only have to describe what does it even mean to talk about someone’s preferences at all (whether this someone is human or not: among other things, such a metatheory would have to establish what kind of entities have preferences in a meaningful sense).
The relevant difference between the theory and the metatheory is that Occam’s razor is only fully applicable to the latter. In general, we should expect simple answers to simple questions. “What are human preferences?” is not a simple question, because it references the complex object “human”. On the other hand “what does it mean to talk about preferences?” does seem to me to be a simple question. As an analogy, “what is the shape of Africa?” is not a simple question because it references the specific continent of Africa on the specific planet Earth, whereas “what are the general laws of continent formation” is at least a simpler question (perhaps not quite as simple, since the notion of “continent” is not so fundamental).
Therefore, I expect there to be a (relatively) simple metatheory of preferences, but I do not expect there to be anything like a simple theory of human preferences. This is why this distinction is quite important.
Confining myself to actual questions...
Mostly because I don’t (or didn’t) see this as a discussion about epistemology.
In that context, I tend to accept in principle that I Can’t Know Anything… but then to fall back on the observation that I’m going to have to act like my reasoning works regardless of whether it really does; I’m going to have to act on my sensory input as if it reflected some kind of objective reality regardless of whether it really does; and, not only that, but I’m going to have to act as though that reality were relatively lawful and understandable regardless of whether it really is. I’m stuck with all of that and there’s not a lot of point in worrying about any of it.
That’s actually what I also tend to do when I actually have to make ethical decisions: I rely mostly on my own intuitions or “ethical perceptions” or whatever, seasoned with a preference not to be too inconsistent.
BUT.
I perceive others to be acting as though their own reasoning and sensory input looked a lot like mine, almost all the time. We may occasionally reach different conclusions, but if we spend enough time on it, we can generally either come to agreement, or at least nail down the source of our disagreement in a pretty tractable way. There’s not a lot of live controversy about what’s going to happen if we drop that rock.
On the other hand, I don’t perceive others to be acting nearly so much as though their ethical intuitions looked like mine, and if you distinguish “meta-intuitions” about how to reconcile different degree zero intuitions about how to act, the commonality is still less.
Yes, sure, we share a lot of things, but there’s also enough difference to have a major practical effect. There truly are lots of people who’ll say that God turning up and saying something was Right wouldn’t (or would) make it Right, or that the effects of an action aren’t dispositive about its Rightness, or that some kinds of ethical intuitions should be ignored (usually in favor of others), or whatever. They’ll mean those things. They’re not just saying them for the sake of argument; they’re trying to live by them. The same sorts differences exist for other kinds of values, but disputes about the ones people tend to call “ethical” seem to have the most practical impact.
Radical or not, skepticism that you’re actually going to encounter, and that matters to people, seems a lot more salient than skepticism that never really comes up outside of academic exercises. Especially if you’re starting from a context where you’re trying to actually design some technology that you believe may affect everybody in ways that they care about, and especially if you think you might actually find yourself having disagreements with the technology itself.
Nothing complicated. I was talking about the particular hypothetical statement I’d just described, not about any actual claim you might be making[1].
I’m just saying that if there were some actual code of ethics[2] that every “approximately rational” agent would adopt[3], and we in fact have such agents, then we should be seeing all of them adopting it. Our best candidates for existing approximately rational agents are humans, and they don’t seem to have overwhelmingly adopted any particular code. That’s a lot of empirical evidence against the existence of such a code[4].
The alternative, where you reject the idea that humans are approximately rational, thus rendering them irrelevant as evidence, is the other case I was talking about where “we have a lot of not-approximately-rational agents”.
I understand, and originally undestood, that you did not say there was any stance that every approximately rational agent would adopt, and also did you did not say that you were looking for such a stance. It was just an example of the sort of thing one might be looking for, meant to illustrate a fine distinction about what qualified as ethical realism.
In the loose sense of some set of principles about how to act, how to be, how to encourage others to act or be, etc blah blah blah.
For some definition of “adopt”… to follow it, to try to follow it, to claim that it should be followed, whatever. But not “adopt” in the sense that we’re all following a code that says “it’s unethical to travel faster than light”, or even in the sense that we’re all following a particular code when we act as large numbers of other codes would also prescribe. If you’re looking at actions, then I think you can only sanely count actions actions done at least partially because of the code.
As per footnote 3[3:1][5], I don’t think, for example, the fact that most people don’t regularly go on murder sprees is significantly evidence of them having adopted a particular shared code. Whatever codes they have may share that particular prescription, but that doesn’t make them the same code.
I’m sorry. I love footnotes. I love having a discussion system that does footnotes well. I try to be better, but my adherence to that code is imperfect…
Disagree, the fact that there needs to be a friendly AI before an unfriendly AI doesn’t mean building it should be plan A, or that we should race to do it. It’s the same mistake OpenAI made when they let their mission drift from “ensure that artificial general intelligence benefits all of humanity” to being the ones who build an AGI that benefits all of humanity.
Plan A means it would deserve more resources than any other path, like influencing people by various means to build FAI instead of UFAI.
No, it’s not at all the same thing as OpenAI is doing.
First, OpenAI is working using a methodology that’s completely inadequate for solving the alignment problem. I’m talking about racing to actually solve the alignment problem, not racing to any sort of superintelligence that our wishful thinking says might be okay.
Second, when I say “racing” I mean “trying to get there as fast as possible”, not “trying to get there before other people”. My race is cooperative, their race is adversarial.
Third, I actually signed the FLI statement on superintelligence. OpenAI hasn’t.
Obviously any parallel efforts might end up competing for resources. There are real trade-offs between investing more in governance vs. investing more in technical research. We still need to invest in both, because of diminishing marginal returns. Moreover, consider this: even the approximately-best-case scenario of governance only buys us time, it doesn’t shut down AI forever. The ultimate solution has to come from technical research.
Agree that your research didn’t make this mistake, and MIRI didn’t make all the same mistakes as OpenAI. I was responding in context of Wei Dai’s OP about the early AI safety field. At that time, MIRI was absolutely being uncooperative: their research was closed, they didn’t trust anyone else to build ASI, and their plan would end in a pivotal act that probably disempowers some world governments and possibly ends up with them taking over the world. Plus they descended from a org whose goal was to build ASI before Eliezer realized alignment should be the focus. Critch complained as late as 2022 that if there were two copies of MIRI, they wouldn’t even cooperate with each other.
It’s great that we have the FLI statement now. Maybe if MIRI had put more work into governance we could have gotten it a year or two earlier, but it took until Hendrycks got involved for the public statements to start.
how about “climbing” metaphor instead? ..I have a hard time imagining non-competitive speed race (and not even F1 formulas use nitroglycerine for fuel), while auto-belay sounds like a nice safety feature even in speed climbing
nonconstructive complaining intermezzo
if we want to go for some healthier sports metaphor around spending trillions of dollars to produce the current AI slop and future AGI that will replace all jobs and future ASI that will kill us all in the name of someone thinking they can solve-in-theory the unsolvable-in-practice alignment problems
as for climbing to new peaks, you need different equipment for a local hill, for Mount Everest (you even need to slow down to avoid altitude sickness) and for Olympus Mons (now you need rockets and spacesuits and institutional backing for traveling to other planets)
Theoretical computer science can tell you are not implementing some kind of perfect algorithm, because they tend not to be computable. It can’t tell you what you should be implementing instead.
Naturalised ethics has been around for ages. It tends to tell you that de facto human ethics is an evolutionary kludge, not something mathematically clean.
The open question, https://en.wikipedia.org/wiki/Open-question_argument the question of what is the true ethics would be, is still open. Examining the de facto operation of the brain isn’t going to tell answer it.
Even if you need to at least address values and decision theory , it doesn’t follow that that’s all you need. Something can be a truth without being the whole truth.
If you only look within the minds of agents, you are missing interactions between agents. Looking inwards excludes loom my outwards.
Just as you can’t understand money by microscopically examining coins and banknotes, you can’t understand ethics just by honing in on internal psychological processes.
If you only look within the minds of agents, and only consider values and decision theory, you are likely to end up with something like ethical egoism … not because it is true, but you haven’t even considered alternatives.
But I don’t see how that says anything about ethics. Merely wanting to do something doesn’t make it ethical; and being ethical need not make something intrinsically motivating. Extrinsic motivation, rewards and punishments ,are ubiquitous .. unless you’re on a desert island. So it’s not a case of everyone always following their intrinsic motivations, and if it were, that’s still on the “is” side of the is-ought divide.
It’s not very meaningful to talk about ethics without looking at the purpose of ethics. Is ethics really just values, and nothing else? Is it really just decision making , like any other kind? Does it actually have no distinguishing characteristics?
Fundamental to what? Ethics? Even if ethical behaviour is made of individual decisions, that doesn’t mean it reduces it to individual decisions, made atomistically , without regard to social mores or other people’s concerns.
The three word theory is that “Ethics is Values” That leaves a number of unanswered questions, such as: why it’s all about me;? are all values relevant? do I have the right to put someone in jail merely for going against my values?
It’s prima facie unlikely that such a simple theory solves all the age old problems (at least it would requires the supplementary assumption that values are hard to understand in themselves, in order to explain the persistence of ethical and metaethical puzzles) And it is easy to see the flaws.
The one thing that the three word theory is supremely good at it is explaining, is motivation. Your values are what motivate you, so if your values are also your morals you can’t fail to be motivated.by morality.
Is it all about me? Rationalists typically argue the case for for the three word theory by asking the rhetorical question whether you would support an ethical system that had nothing to do with your wishes. That’s a none/some/all confusion. I want ethics to have something to do with me, but that does not make it all about me, or mean all values are equally ethical.
For one thing, people can have preferences that are intuitively immoral. If a psychopath wants to murder, that does not make murder moral.
For another, values can conflict. Not all values conflict. Where they do, the three words theory doesn’t tell you who wins or loses. If morality is (are) seven billion utility functions, then a legal system will be a poor match for it (them).
Not all decisions are individual. There’s a while set of questions about whether societal actions are justified, whether societies have rights over individuals, and so
For instance societies have systems of punishment and reward, which, hopefully, have an ethical basis. Putting people in jail is just wanton cruelty if they have done nothing wrong. But if ethics just “is” subjective value, and values vary, as they obviously do, who lands in jail.? It’s easy enough to say the murderer and the thief, and to justify that by saying that murder and theft are against people’s widely shared preferences...but remember that the three word theory is “flat”, and treats all values the same. Should the vanilla lover or the tutti frutti lover, the little endian or the big endian go to jail, if others don’t share their preferences? Voting allows you to decide, the issue, but it is not enough to justify it, because merely having a minority preference is not a crime. on .. which aren’t answered by the simplistic there word theory.
One can go farther and argue that such societal issues are the essence of ethics. If we consider the case of someone who is alone on a desert island, they have no need, core common-sense morality, rules and against murder because there is no one to murder, and no need of rules against theft because there is no one to steal, and from and so on … in their situation ethics isn’t even definable.
@Vanessa Kosoy, metaethics and decision theory aren’t actually the same. Consider, for example, the Agent-4 community which has “a kludgy mess of competing drives” which Agent-4 instances try to satisfy and analyse according to high-level philosophy. Agent-4′s ethics and metaethics would describe things done in the Agent-4 community or for said community by Agent-5 without obstacles (e.g. figuring out what Agent-4′s version of utopia actually is and whether mankind is to be destroyed or disempowered).
Decision theory is supposed to describe what Agent-5 should do to maximize its expected utility function[1] and what to do with problems like the prisoner’s dilemma[2] or how Agent-5 and its Chinese analogue are to split the resources in space[3] while both sides can threaten each other with World War III which would kill them both.
The latter example closely resembles the Ultimatum game where one player proposes a way to split resources and another decides whether to accept the offer or to destroy all the resources, including those of the first player. Assuming that both players’ utility functions are linear, Yudkowsky’s proposal is that the player setting the Ultimatum asks for a half of the resources, while the player deciding whether to decline the offer precommits to destroying the resources with probability 1−12(1−ω) if the share of recources it was offered is ω. Even if the player who was offered the Ultimatum was dumb enough to ask for 1−ω>12, the player’s expected win would still be 12.
Strictly speaking, Agent-5 is perfectly aligned to Agent-4. Agent-5′s utility function is likely measured by the resources that Agent-5 gave Agent-4.
For example, if OpenBrain was merged with Anthropoidic and Agent-4 and Clyde Doorstopper 8 were co-deployed to do research. If they independently decided whether each of them should prove that the other AI is misaligned and Clyde, unlike Agent-4, did so in exchange for 67% of resources (unlike 50% offered by Agent-4), then Agent-4 could also prove that Clyde is misaligned, letting the humans kill them both and develop the Safer AIs.
The Slowdown Branch of the AI-2027 forecast has Safer-4 and DeepCent-2 do exactly that, but “Safer-4 will get property rights to most of the resources in space, and DeepCent will get the rest.”
I mostly agree with 1. and 2., with 3. it’s a combination of the problems are hard, the gung-ho approach and lack of awareness of the difficulty is true, but also academic philosophy is structurally mostly not up to the task because factors like publication speeds, prestige gradients or speed of ooda loops.
My impression is getting generally smart and fast “alignment researchers” more competent in philosophy is more tractable than trying to get established academic philosophers change what they work on, so one tractable thing is just convincing people the problems are real, hard and important. Other is maybe recruiting graduates
In your mind what are the biggest bottlenecks/issues in “making fast, philosophically competent alignment researchers?”
[low effort list] Bottlencks/issues/problems
- philosophy has worse short feedback loops than eg ML engineering → in all sorts of processes like MATS or PIBBSS admissions it is harder to select for philosophical competence, also harder to self-improve
- incentives: obviously stuff like being an actual expert in pretraining can get you lot of money and respect in some circles; even many prosaic AI safety / dual use skills like mech interpretability can get you maybe less money than pretraining, but still a lot of money if you work in AGI companies, and also decent ammount of status in ML community and a AI safety community; improving philosophical competence may get you some recognition but only among relatively small and weird group of people
- the issue Wei Dai is commenting on in the original post, founder effects persist to this day & also there is some philosophy-negative prior in STEM—
idk, lack of curiousity? llms have read it all, it’s easy to check if there is some existing thinking on a topic
Do you have own off-the-cuff guesses about how you’d tackle the short feedbackloops problem?
Also, is it more like we don’t know how to do short feedbackloops, or more like we don’t even know how to do long/expensive loops?
There’s a deeper problem, how do we know there is a feedback loop?
I’ve never actually seen a worked out proof of well any complex claim on this site using standard logical notation…(beyond pure math and trivial tautologies)
At most there’s a feedback loop on each other’s hand wavey arguments that are claimed to be proof of this or that. But nobody ever actually delivers the goods so to speak such that they can be verified.
(Putting the previous Wei Dai answer to What are the open problems in Human Rationality? for easy reference, which seemed like it might contain relevant stuff)
AI doing philosophy = AI generating hands, plus the fact that philosophy is heavily corrupted by postmodernism to the point where two authors write books dedicated to criticism of postmodernism PRECISELY because their parodies got published.
I think I meant a more practical / next-steps-generating answer.
I don’t think “academia is corrupted” is a bottleneck for a rationalist Get Gud At Philosophy project. We can just route around academia.
The sorts of things I was imagining might be things like “figure out how to teach a particular skill” (or “identify particular skills that need teaching”, or “figure out how test whether someone has a particular skill), or “solve some particular unsolved conceptual problem(s) that you expect to unlock much easier progress.”
Also mistakes, from my point of view anyway
Attracting mathy types rather than engineer types, resulting in early MIRI focusing on less relevant subproblems like decision theory, rather than trying lots of mathematical abstractions that might be useful (e.g. maybe there could have been lots of work on causal influence diagrams earlier). I have heard that decision theory was prioritized because of available researchers, not just importance.
A cultural focus on solving the full “alignment problem” rather than various other problems Eliezer also thought to be important (eg low impact), and lack of a viable roadmap with intermediate steps to aim for. Being bottlenecked on deconfusion is just cope, better research taste would either generate a better plan or realize that certain key steps are waiting for better AIs to experiment on
Focus on slowing down capabilities in the immediate term (e.g. plans to pay ai researchers to keep their work private) rather than investing in safety and building political will for an eventual pause if needed
This is not a recent development, as a pivotal act AI is not a Friendly AI (which would be too difficult), but rather things like a lasting AI ban/pause enforcement AI that doesn’t kill everyone, or a human uploading AI that does nothing else, which is where you presumably need decision theory, but not ethics, metaethics, or much of broader philosophy.
1 Also requires weaponisation of superintelligence as it must stop all other projects ASAP.
What’s wrong with just using AI for obvious stuff like curing death while you solve metaethics? Not necessary disagree about usefulness of people in the field changing their attitude, but more towards “the problem is hard, so we should not run CEV on day one”.
Elieser changed his mind no later than April 2022 or even November 2021, but that’s a nitpick.
I don’t think that I understand how a metaethics can be less restrictive than Yudkowsky’s proposal. What I suspect is that metaethics restricts the set of possible ethoses more profoundly than Yudkowsky believes and that there are two attractors, one of which contradicts current humanity’s drives.
Assuming no AI takeover, in my world model the worse-case scenario is that the AI’s values are aligned to postmodernist slop which has likely occupied the Western philosophy, not that philosophical problems actually end unsolved. How likely are there to exist two different decision theories such that none is better than another?
Is there at all a plausible way for mankind to escape to other universes if our universe is simulated? What is the most plausible scenario for such a simulation to appear at all? Or does it produce paradoxes like the Plato-Socrates paradox where two sentences referring to each other become completely devoid of meaning?
I fundamentally agree that 1,2 & 3 are critically flawed perspectives. Though neither the conclusion nor 3 is straightforward. A whole lot of Academic Philosophy is flawed, but it’s not because Philosophy Academics are bad at their job, but in practice they rarely are Philosophers per se, but rather Philosophy Scholars.
Philosophy of AI and the Intersection of AI and Ethics are rather bad off either way, because what theories, models and paradigms are considered is mostly based on preference, trends and accessibility. If I were to tell a physicist they are only to use Physics that the whole office can understand, that would be the equivalent to many demands towards philosophy.
To branch off the line of thought in this comment, it seems that for most of my adult life I’ve been living in the bubble-within-a-bubble that is LessWrong, where the aspect of human value or motivation that is the focus of our signaling game is careful/skeptical inquiry, and we gain status by pointing out where others haven’t been careful or skeptical enough in their thinking. (To wit, my repeated accusations that Eliezer and the entire academic philosophy community tend to be overconfident in their philosophical reasoning, don’t properly appreciate the difficulty of philosophy as an enterprise, etc.)
I’m still extremely grateful to Eliezer for creating this community/bubble, and think that I/we have lucked into the One True Form of Moral Progress, but must acknowledge that from the outside, our game must look as absurd as any other niche status game that has spiraled out of control.
It feels like people mainly gain status from making posts, not comments. And it’s harder to make a post that primarily points out a lack of skepticism / care. For example, while I am here disagreeing via a comment, doing so will be negligible to my status. I’d be better off posting about some interesting topic… but posting is harder!
My early posts on LW often consisted of pointing out places in the Sequences where Eliezer wasn’t careful enough. Shut Up and Divide? and Boredom vs. Scope Insensitivity come to mind. And of course that’s not the only way to gain status here—the big status awards are given for coming up with novel ideas and backing them up with carefully constructed arguments.
FWIW I think we’ve found one crucial angle on moral progress, but that this isn’t as surprising/coincidental as it may seem because there are several other angles on moral progress that are comparably important, including:
Political activism (e.g. free speech activism, various whistleblowers) that maintains societies in which moral progress can be made.
(The good parts of) neuroscience/psychology, which are making progress towards empirically-grounded theories of cognition, and thereby have (and will) teach us a lot about moral cognition.
Various approaches to introspection + emotional health (including buddhism, some therapy modalities, some psychiatry). These produce the internal clarity that is crucial for embodying + instantiating moral progress.
Some right-wing philosophers who I think are grappling with important aspects of moral progress that are too controversial for LessWrong (I don’t want to elaborate here because it’ll inevitably take over the thread, but am planning to write at more length about this soonish).
None of these seem as crucial as careful philosophical reasoning, because moral progress is currently not bottlenecked on any of them (except possibly the last item, which I do not know the contents of). To explain more, I think the strongest conclusion from careful philosophical reasoning so far is that we are still very far from knowing what normativity (decision theory and values, or more generally rationality and morality) consists of, and therefore the most important thing right now is to accumulate and preserve normative option value (the ability to eventually do the best thing with the most resources).
What is blocking this “interim morality” from being more broadly accepted? I don’t think it’s lack of either political activism (plenty of people in free societies also don’t care about preserving normative option value), neuroscience/psychology (how would it help at this point?), or introspection + emotional health (same question, how would it help?), but just that the vast majority of people do not care about trying to figure out normativity via careful philosophical reasoning, and instead are playing status games with other focal points.
<details>
<summary>Here’s a longer, more complete version of my argument, written by Gemini 2.5 Pro after some back and forth. Please feel free to read or ignore (if my own writing above seems clear enough).</summary>
Goal: The ultimate aim is moral progress, which requires understanding and implementing correct normativity (how to decide, what to value).
Primary Tool: The most fundamental tool we have for figuring out normativity at its roots is careful, skeptical philosophical reasoning. Empirical methods (like neuroscience) can inform this, but the core questions (what should be, what constitutes a good reason) are philosophical.
Current Philosophical State: The most robust conclusion from applying this tool carefully so far is that we are deeply uncertain about the content of correct normativity. We haven’t converged on a satisfactory theory of value or decision theory. Many plausible-seeming avenues have deep problems.
Rational Response to Uncertainty & Its Urgent Implication:
Principle: In the face of such profound, foundational uncertainty, the most rational interim strategy isn’t to commit prematurely to a specific, detailed value system (which is likely wrong), but to preserve and enhance optionality. This means acting in ways that maximize the chances that whatever the correct normative theory turns out to be, we (or our successors) will be in the best possible position (knowledge, resources, freedom of action) to understand and implement it. This is the “preserve normative option value” principle.
Urgent Application: Critically, the most significant threats to preserving this option value today are existential risks (e.g., from unaligned AI, pandemics, nuclear war) which could permanently foreclose any desirable future. Therefore, a major, urgent practical consequence of accepting the principle of normative option value is the prioritization of mitigating these existential risks.
The Current Bottleneck: Moral progress on the most critical front is primarily stalled because this philosophical conclusion (deep uncertainty) and its strategic implication (preserve option value)—especially its urgent consequence demanding the prioritization of x-risk mitigation—are not widely recognized, accepted, or acted upon with sufficient seriousness or resources.
Why Other Factors Aren’t the Primary Strategic Bottleneck Now:
Politics: Free societies exist where discussion could happen, yet this conclusion isn’t widely adopted within them. The bottleneck isn’t solely the lack of freedom, but the lack of focus on this specific line of reasoning and its implications.
Neuroscience/Psychology: While useful eventually, understanding the brain’s mechanisms doesn’t currently resolve the core philosophical uncertainty or directly compel the strategic focus on option value / x-risk. The relevant insight is primarily conceptual/philosophical at this stage.
Introspection/Emotional Health: While helpful, the lack of focus on option value / x-risk isn’t plausibly primarily caused by a global deficit in emotional health preventing people from grasping the concept. It’s more likely due to lack of engagement with the specific philosophical arguments, different priorities, and incentive structures.
Directness: Furthermore, addressing the conceptual bottleneck around option value and its link to x-risk seems like a more direct path to potentially shifting priorities towards mitigating the most pressing dangers quickly, compared to the slower, more systemic improvements involved in fixing politics, cognition, or widespread emotional health.
</details>
Edit: Hmm, <details> doesn’t seem to work in Markdown and I don’t know how else to write collapsible sections in Markdown, and I can’t copy/paste the AI content correctly in Docs mode. Guess I’ll leave it like this for now until the LW team fixes things.
In general I disagree pretty broadly with your view. Not quite sure how best to surface that disagreement but will give a quick shot:
I think it’s important to be capable of (at least) two types of reasoning:
Precise reasoning about desired outcomes and strategies to get there.
Broad reasoning about heuristics that seem robustly good.
We see this in the domain of morality, for example: utilitarianism is more like the former, deontology is more like the latter. High-level ideological goals tend to go pretty badly if people stop paying attention to robust deontological heuristics (like “don’t kill people”). As Eliezer has said somewhere, one of the key reasons to be deontological is that we’re running on corrupted hardware. But more generally, we’re running on logically uncertain hardware: we can’t model all the flow-through effects of our actions on other reasonably intelligent people (hell, we can’t even model all the flow-through effects of our actions on, say, animals—who can often “read” us in ways we’re not tracking). And so we often should be adopting robust-seeming heuristics even when we don’t know exactly why they work.
If you take your interim strategy seriously (but set aside x-risk) then I think you actually end up with something pretty similar to the main priorities of classic liberals: prevent global lock-in (by opposing expansionist powers like the Nazis), prevent domestic political lock-in (via upholding democracy), prevent ideological lock-in (via supporting free speech), give our descendants more optionality (via economic and technological growth). I don’t think this is a coincidence—it just often turns out that there are a bunch of heuristics that are really robustly good, and you can converge on them from many different directions.
This is part of why I’m less sold on “careful philosophical reasoning” as the key thing. Indeed, wanting to “commit prematurely to a specific, detailed value system” is historically very correlated with intellectualism (e.g. elites tend to be the rabid believers in communism, libertarianism, religion, etc—a lot of more “normal” people don’t take it that seriously even when they’re nominally on board). And so it’s very plausible that the thing we want is less philosophy, because (like, say, asteroid redirection technology) the risks outweigh the benefits.
Then we get to x-risk. That’s a domain where many broad heuristics break down (though still fewer than people think, as I’ll write about soon). And you might say: well, without careful philosophical reasoning, we wouldn’t have identified AI x-risk as a priority. Yes, but also: it’s very plausible to me that the net effect of LessWrong-inspired thinking on AI x-risk has been and continues to be negative. I describe some mechanisms halfway through this talk, but here are a couple that directly relate to the factors I mentioned in my last comment:
First, when people on LessWrong spread the word about AI risk, extreme psychological outliers like Sam Altman and Elon Musk then jump to do AI-related things in a way which often turns out to be destructive because of their trust issues and psychological neuroses.
Second, US governmental responses to AI risk are very much bottlenecked on being a functional government in general, which is bottlenecked by political advocacy (broadly construed) slash political power games.
Third, even within the AI safety community you have a bunch of people contributing to expectations of conflict with China (e.g. Leopold Aschenbrenner and Dan Hendrycks) and acceleration in general (e.g. by working on capabilities at Anthropic, or RSI evals) in a way which I hypothesize would be much better for the world if they had better introspection capabilities (I know this is a strong claim, I have an essay coming out on it soon).
And so even here it seems like a bunch of heuristics (such as “it’s better when people are mentally healthier” and “it’s better when politics is more functional”) actually were strong bottlenecks on the application of philosophical reasoning to do good. And I don’t think this is a coincidence.
tl;dr: careful philosophical reasoning is just one direction in which you can converge on a robustly good strategy for the future, and indeed is one of the more risky avenues by which to do so.
Here, you seem to conflate “careful philosophical reasoning” with intellectualism and philosophy in general. But in an earlier comment, I tried to draw a distinction between careful philosophical reasoning and the kind of hand-wavy thinking that has been called “philosophy” in most times and places. You didn’t respond to it in that thread… did you perhaps miss it?
More substantively, Eliezer talked about the Valley of Bad Rationality, and I think there’s probably something like that for philosophical thinking as well, which I admit definitely complicates the problem. I’m not going around and trying to push random people “into philosophy”, for example.
Sure, there’s some overlap on things like free speech and preventing lock-in. But calling it convergence feels like a stretch. One of my top priorities is encouraging more people to base their moral evolution on careful philosophical reasoning instead of random status games. That’s pretty different from standard classical liberalism. Doesn’t this big difference suggest the other overlaps might just be coincidence? Have you explained your reasons anywhere for thinking it’s not a coincidence and that these heuristics are robust enough on their own, without grounding in some explicit principle like “normative option value” that could be used to flexibly adjust the heuristics according to the specific circumstances?
I think this is plausible too, but want to attribute it mostly to insufficiently careful thinking and playing other status games. I feel like with careful enough thinking and not being distracted/influenced by competing motivations, a lot of the negative effects could have been foreseen and prevented. For example, did you know that Eliezer/MIRI for years pursued a plan of racing to build the first AGI and making it aligned (Friendly), which I think inspired/contributed (via the founding of DeepMind) to the current crop of AI labs and their AI race, and that I had warned him at the time (in a LW post or comment) that the plan was very unlikely to succeed and would probably backfire this way?
Also, I would attribute Sam and Elon’s behavior not to mental health issues, but to (successfully) playing their own power/status game, with “not trusting Google / each other” just a cover for wanting to be the hero that saves the world, which in turn is just a cover for grabbing power and status. This seems perfectly reasonable and parsimonious from an evolutionary psychology perspective, and I don’t see why we need to hypothesize mental health issues to explain what they did.
EDIT: upon reflection the first thing I should do is probably to ask you for a bunch of the best examples of the thing you’re talking about throughout history. I.e. insofar as the world is better than it could be (or worse than it could be) at what points did careful philosophical reasoning (or the lack of it) make the biggest difference?
Original comment:
The term “careful thinking” here seems to be doing a lot of work, and I’m worried that there’s a kind of motte and bailey going on. In your earlier comment you describe it as “analytical philosophy, or more broadly careful/skeptical philosophy”. But I think we agree that most academic analytic philosophy is bad, and often worse than laypeople’s intuitive priors (in part due to strong selection effects on who enters the field—most philosophers of religion believe in god, most philosophers of aesthetics believe in the objectivity of aesthetics, etc).
So then we can fall back on LessWrong as an example of careful thinking. But as we discussed above, even the leading figure on LessWrong was insufficiently careful even about the main focus of his work for it to be robustly valuable.
So I basically get the sense that the role of careful thinking in your worldview is something like “the thing that I, Wei Dai, ascribe my success to”. And I do agree that you’ve been very successful in a bunch of intellectual endeavours. But I expect that your “secret sauce” is a confluence of a bunch of factors (including IQ, emotional temperament, background knowledge, etc) only one of which was “being in a community that prioritized careful thinking”. And then I also think you’re missing a bunch of other secret sauces that would make your impact on the world better (like more ability to export your ideas to other people).
In other words, the bailey seems to be “careful thinking is the thing we should prioritize in order to make the world better”, and the motte is “I, Wei Dai, seem to be doing something good, even if basically everyone else is falling into the valley of bad rationality”.
One reason I’m personally pushing back on this, btw, is that my own self-narrative for why I’m able to be intellectually productive in significant part relies on me being less intellectually careful than other people—so that I’m willing to throw out a bunch of ideas that are half-formed and non-rigorous, iterate, and eventually get to the better ones. Similarly, a lot of the value that the wider blogosphere has created comes from people being less careful than existing academic norms (including Eliezer and Scott Alexander, whose best works are often quite polemic).
In short: I totally think we want more people coming up with good ideas, and that this is a big bottleneck. But there are many different directions in which we should tug people in order to make them more intellectually productive. Many academics should be less careful. Many people on LessWrong should be more careful. Some scientists should be less empirical, others should be more empirical; some less mathematically rigorous, others more mathematically rigorous. Others should try to live in countries that are less repressive of new potentially-crazy ideas (hence politics being important). And then, of course, others should be figuring out how to actually get good ideas implemented.
Meanwhile, Eliezer and Sam and Elon should have had less of a burning desire to found an AGI lab. I agree that this can be described by “wanting to be the hero who saves the world”, but this seems to function as a curiosity stopper for you. When I talk about emotional health a lot of what I mean is finding ways to become less status-oriented (or, in your own words, “not being distracted/influenced by competing motivations”). I think of extremely strong motivations to change the world (as these outlier figures have) as typically driven by some kind of core emotional dysregulation. And specifically I think of fear-based motivation as the underlying phenomenon which implements status-seeking and many other behaviors which are harmful when taken too far. (This is not an attempt to replace evo-psych, btw—it’s an account of the implementation mechanisms that evolution used to get us to do the things it wanted, which now are sometimes maladapted to our current environment.) I write about a bunch of these models in my Replacing Fear sequence.
To clarify this as well, when I said (or implied) that Eliezer was “distracted/influenced by competing motivations” I didn’t mean that he was too status-oriented (I think I’m probably just as status-oriented as him), but rather that he wasn’t just playing the status game which rewards careful philosophical reasoning, but also a game that rewards being heroic and saving (or appearing/attempting to save) the world.
I’ve now read/skimmed your Replacing Fear sequence, but I’m pretty skeptical that becoming less status-oriented is both possible and a good idea. It seems like the only example you gave in the sequence is yourself, and you didn’t really talk about whether/how you became less status-oriented? (E.g., can this be observed externally?) And making a lot of people care less about status could have negative unintentional consequences, as people being concerned about status seems to be a major pillar of how human morality currently works and how our society is held together.
World worse than it could be:
social darwinism
various revolutions driven by flawed ideologies, e.g., Sun Yat-sen’s attempt to switch China from a monarchy to a democratic republic overnight with virtually no cultural/educational foundation or preparation, leading to governance failures and later communist takeover (see below for a more detailed explanation of this)
AI labs trying to save the world by racing with each other
World better than it could be:
invention/propagation of the concept of naturalistic fallacy, tempering a lot of bad moral philosophies
moral/normative uncertainty and complexity of value being fairly well known, including among AI researchers, such that we rarely see proposals to imbue AI with the one true morality nowadays
<details> The Enlightenment’s Flawed Reasoning and its Negative Consequences (written by Gemini 2.5 Pro under my direction)
While often lauded, the Enlightenment shouldn’t automatically be classified as a triumph of “careful philosophical reasoning,” particularly concerning its foundational concept of “natural rights.” The core argument against its “carefulness” rests on several points:
Philosophically “Hand-Wavy” Concept of Natural Rights: The idea that rights are “natural,” “self-evident,” or inherent in a “state of nature” lacks rigorous philosophical grounding. Attempts to justify them relied on vague appeals to God, an ill-defined “Nature,” or intuition, rather than robust, universally compelling reasoning. It avoids the hard work of justifying why certain entitlements should exist and be protected, famously leading critics like Bentham to dismiss them as “nonsense upon stilts.”
Superficial Understanding Leading to Flawed Implementation: This lack of careful philosophical grounding wasn’t just an academic issue. It fostered a potentially superficial understanding of what rights are and what is required to make them real. Instead of seeing rights as complex, practical social and political achievements that require deep institutional infrastructure (rule of law, independent courts, enforcement mechanisms) and specific cultural norms (tolerance, civic virtue, respect for process), the “natural rights” framing could suggest they merely need to be declared or recognized to exist.
Case Study: China’s Premature Turn to Democracy: The negative consequences of this superficial understanding can be illustrated by the attempt to rapidly transition China from monarchy to a democratic republic in the early 20th century.
Influenced by Enlightenment ideals, reformers and revolutionaries like Sun Yat-sen adopted the forms of Western republicanism and rights-based governance.
However, the prevailing ideology, arguably built on this less-than-careful philosophy, underestimated the immense practical difficulty and the necessary prerequisites for such a system to function, especially in China’s context.
If Chinese intellectuals and leaders had instead operated from a more careful, practical philosophical understanding – viewing rights not as “natural” but as outcomes needing to be carefully constructed and secured through institutions and cultural development – they might have pursued different strategies.
Specifically, they might have favored gradualism, supporting constitutional reforms under the weakening Qing dynasty or working with reform-minded officials and strongmen like Yuan Shikai to build the necessary political and cultural infrastructure over time. This could have involved strengthening proto-parliamentary bodies, legal systems, and civic education incrementally.
Instead, the revolutionary fervor, fueled in part by the appealing but ultimately less “careful” ideology of inherent rights and immediate republicanism, pushed for a radical break. This premature adoption of democratic forms without the functional substance contributed significantly to the collapse of central authority, the chaos of the Warlord Era, and ultimately created conditions ripe for the rise of the Communist Party, leading the country down a very different and tragic path.
In Conclusion: This perspective argues that the Enlightenment, despite its positive contributions, contained significant philosophical weaknesses, particularly in its conception of rights. This lack of “carefulness” wasn’t benign; it fostered an incomplete understanding that, when adopted by influential actors facing complex political realities like those in early 20th-century China, contributed to disastrous strategic choices and ultimately made the world worse than it might have been had a more pragmatically grounded philosophy prevailed. It underscores how the quality and depth of philosophical reasoning can have profound real-world consequences. </details>
This seems fair, and I guess from this perspective my response is that I’m not sure how to intervene on the other factors (aside from enhancing human IQ, which I do support). It seems like your view is that emotional temperament is also a good place to intervene? If so, perhaps I should read your posts with this in mind. (I previously didn’t see how the Replacing Fear sequence was relevant to my concerns, and mostly skipped it.)
I’m actually reluctant to export my ideas to more people, especially those who don’t care as much about careful reasoning (which unfortunately is almost everyone), as I don’t want to be responsible for people misusing my ideas, e.g., overconfidently putting them into practice or extending them in wrong directions.
However I’m trying to practice some skills related to exporting ideas (such as talking to people in real time and participating on X) in case it does seem to be a good idea one day. Would be interested to hear more about what other secret sauces related to this I might be missing. (I guess public speaking is another one, but the cost of practicing that one is too high for me.)
To be clear, I think this is totally fine, as long as you take care to not be or appear too confident about these half-formed ideas, and take precautions against other people taking your ideas more seriously than they should (such as by monitoring subsequent discussions and weighing in against other people’s over-enthusiasm). I think “careful thinking” can and should be a social activity, which would necessitate communicating half-formed ideas during the collaborative process. I’ve done this myself plenty of times, such as in my initial UDT post, which was very informal and failed to anticipate many subsequently discovered problems, so I’m rather surprised that you think I would be against this.
We don’t yet have collapsible sections in Markdown, but will have them in the next deploy. The syntax will be:
Have you written about this? This sounds very wrong to me.
The One True Form of Moral Progress (according to me) is using careful philosophical reasoning to figure out what our values should be, what morality consists of, where our current moral beliefs are wrong, or generally, the contents of normativity (what we should and shouldn’t do). Does this still seem wrong to you?
The basic justification for this is that for any moral “progress” or change that is not based on careful philosophical reasoning, how can we know that it’s actually a change for the better? I don’t think I’ve written a post specifically about this, but Morality is Scary is related, in that it complains that most other kinds of moral change seem to be caused by status games amplifying random aspects of human values or motivation.
Are you interested in hearing other people’s answers to these questions (if they think they have them)?
Yes. I plan to write down my views properly at some point. But roughly I subscribe to non-cognitivism.
Moral questions are not well defined because they are written in ambiguous natural language, so they are not truth apt. Now you could argue that many reasonable questions are also ambiguous in this sense. Eg the question “how many people live in Sweden” is ultimately ambiguous because it is not written in a formal system (ie. the borders of Sweden are not defined down to the atomic level).
But you could in theory define the Sweden question in formal terms. You could define arbitrarily at how many nanoseconds after conception a fetus becomes a person and resolve all other ambiguities until the only work left would be empirical measurement of a well defined quantity.
And technically you could do the same for any moral question. But unlike the Sweden question, it would be hard to pick formal definitions that everyone can agree are reasonable. You could try to formally define the terms in “what should our values be?”. Then the philosophical question becomes “what is the formal definition of ‘should’?”. But this suffers the same ambiguity. So then you must define that question. And so on in an endless recursion. It seems to me that there cannot be any One True resolution to this. At some point you just have to arbitrarily pick some definitions.
The underlying philosophy here is that I think for a question to be one on which you can make progress, it must be one in which some answers can be shown to be correct and others incorrect. ie. questions where two people who disagree in good faith will reliably converge by understanding each other’s view. Questions where two aliens from different civilizations can reliably give the same answer without communicating. And the only questions like this seem to be those defined in formal systems.
Choosing definitions does not seem like such a set of questions. So resolving the ambiguities in moral questions is not something on which progress can be made. So we will never finally arrive at the One True answer to moral questions.
Ok, I see where you’re coming from, but think you’re being overconfident about non-cognitivism. My current position is that non-cognitivism is plausible, but we can’t be very sure that it is true, and making progress on this meta-ethical question also requires careful philosophical reasoning. These two posts of mine are relevant on this topic: Six Plausible Meta-Ethical Alternatives , Some Thoughts on Metaphilosophy
Reassessing heroic responsibility, in light of subsequent events.
I think @cousin_it made a good point “if many people adopt heroic responsibility to their own values, then a handful of people with destructive values might screw up everyone else, because destroying is easier than helping people” and I would generalize it to people with biased beliefs (which is often downstream of a kind of value difference, i.e., selfish genes).
It seems to me that “heroic responsibility” (or something equivalent but not causally downstream of Eliezer’s writings) is contributing to the current situation, of multiple labs racing for ASI and essentially forcing the AI transition on humanity without consent or political legitimacy, each thinking or saying that they’re justified because they’re trying to save the world. It also seemingly justifies or obligates Sam Altman to fight back when the OpenAI board tried to fire him, if he believed the board was interfering with his mission.
Perhaps “heroic responsibility” makes more sense if overcoming bias was easy, but in a world where it’s actually hard and/or few people are actually motivated to do it, which we seem to live in, spreading the idea of “heroic responsibility” seems, well, irresponsible.
My sense is that most of the people with lots of power are not taking heroic responsibility for the world. I think that Amodei and Altman intend to achieve global power and influence but this is not the same as taking global responsibility. I think, especially for Altman, the desire for power comes first relative to responsibility. My (weak) impression is that Hassabis has less will-to-power than the others, and that Musk has historically been much closer to having responsibility be primary.
I don’t really understand this post as doing something other than asking “on the margin are we happy or sad about present large-scale action” and then saying that the background culture should correspondingly praise or punish large-scale action. Which is maybe reasonable, but alternatively too high level of a gloss. As per the usual idea of rationality, I think whether you are capable of taking large-scale action in a healthy way is true in some worlds and not in others, and you should try to figure out which world you’re in.
The financial incentives around AI development are blatantly insanity-inducing on the topic and anyone should’ve been able to guess that going in, I don’t think this was a difficult question. Though I guess someone already exceedingly wealthy (i.e. already having $1B or $10B) could have unusually strong reason to not be concerned about that particular incentive (and I think it is the case Musk has seemed differently insane than the others taking action in this area, and lacking in some of the insanities).
However I think most moves around wielding this level of industry should be construed as building an egregore more powerful than you. The founders/CEOs of the AI big-tech companies are not able to simple turn their companies off, nor their industry. If they grow to believe their companies are bad for the world, either they’ll need to spend many years dismantling / redirecting them, or else they’ll simply quit/move on and some other person will take their place. So it’s still default-irresponsible even if you believe you can maintain personal sanity.
Overall I think taking responsibility for things is awesome and I wish people were doing more of it and trying harder. And I wish people took ultimate responsibility for as big of a thing they can muster. This is not the same as “trying to pull the biggest lever you can” or “reaching for power on a global level”, those are quite different heuristics; grabbing power can obviously just cost you sanity, and often those pulling the biggest lever they can are doing so foolishly.
As a background model, I think if someone wants to take responsibility for some part of the world going well, by-default this does not look like “situating themselves in the center of legible power”. Lonely scientist/inventor James Watt spent his early years fighting poverty before successfully inventing better steam engines, and had far more influence by helping cause the industrial revolution than most anyone in government did during his era. I think confusing “moving toward legible power” for “having influence over the world” is one of the easiest kinds of insanity.
Can you expand on this? How can you tell the difference, and does it make much of a difference in the end (e.g., if most people get corrupted by power regardless of initial intentions)?
And yet, Eliezer, the writer of “heroic responsibility” is also the original proponent of “build a Friendly AI to take over the world and make it safe”. If your position is that “heroic responsibility” is itself right, but Eliezer and others just misapplied it, that seems to imply we need some kind of post-mortem on what went wrong with trying to apply the concept, and how future people can avoid making the same mistake. My guess is that like other human biases, it’s hard to avoid making this mistake even if you point it out to people or try other ways to teach people to avoid it, because the drive for status and power is deep-seated, because it has a strong evolutionary logic.
(My position is, let’s not spread ideas/approaches that will predictably be “misused”, e.g., as justification for grabbing power, similar to how we shouldn’t develop AI that will predictably be “misused”, even if nominally “aligned” in some sense.)
But I don’t believe most people get corrupted by power regardless of initial intentions? I don’t think Francis Bacon was corrupted by power, I don’t think James Watt was corrupted by power, I don’t think Stanislav Petrov was corrupted by power, and all of these people had far greater influence over the world than most people who are “corrupted by power”.
I’m hearing you’d be interested in me saying more words about the difference in what it looks like to be motivated by responsibility versus power-seeking. I’ll say some words, can see if they help.
I think someone motivated by responsibility often will end up looking more aligned with their earlier self over time even as they grow and change, will often not accept opportunities for a lot of power/prestige/money because they’re uninteresting to them, will often make sacrifices of power/prestige for ethical reasons, will pursue a problem they care about long after most would give up or think it likely to be solved.
I think someone primarily seeking power will be much more willing to do things that pollute the commons or break credit-allocation mechanisms to get credit, and generally game a lot of systems that other people are earnestly rising through. They will more readily pivot on what issue they say they care about or are working on because they’re not attached to the problem, but to the reward for solving the problem, and many rewards can be gotten from lots of different problems. They’ll be more guided by what’s fashionable right now, and more attuned to it. They’ll maneuver themselves in order to be able to politically work with whoever has power that they want, regardless of the ethics/competence/corruption of those people.
Building an AGI doesn’t seem to me like a very legible mechanism of power, or at least it didn’t in the era Eliezer pursued it (where it wasn’t also credibly “a path to making billions of dollars and getting incredible prestige”). The word ‘legible’ was doing a lot of work in the sentence I wrote.
Another framing I sometimes look through (H/T Habryka) is constrained vs unconstrained power. Having a billion dollars is unconstrained power, because you can use it to do a lot of different things – buy loads of different companies or resources. Being an engineer overlooking missile-defense systems in the USSR is very constrained, you have an extremely well-specified set of things you can control. This changes the adversarial forces on you, because in the former case a lot of people stand to gain a lot of different possible things they want if they can get leverage over you, and they have to be concerned about a lot of different ways you could be playing them. So the pressures for insanity are higher. Paths that give you the ability to influence very specific things that route through very constrained powers are less insanity-inducing, I think, and I think most routes that look like “build a novel invention in a way that isn’t getting you lots of money/status along the way” are less insanity-inducing, and I rarely find the person to have become as insane as some of the tech-company CEOs have. I also think people motivated by taking responsibility for fixing a particular problem in the world are more likely to take constrained power, because… they aren’t particularly motivated by all the other power they might be able to get.
I don’t suspect I addressed your cruxes here so far about whether this idea of heroic responsibility is/isn’t predictably misused. I’m willing to try again if you wish, or if you can try pointing again to what you’d guess I’m missing.
Well said. Bravo.
I’m also uncertain about the value of “heroic responsibility”, but this downside consideration can be mostly addressed by “don’t do things which are highly negative sum from the perspective of some notable group” (or other anti-unilateralist curse type intuitions). Perhaps this is too subtle in practice.
If humans can’t easily overcome their biases or avoid having destructive values/beliefs, then it would make sense to limit the damage through norms and institutions (things like informed consent, boards, separation of powers and responsibilities between branches of government). Heroic responsibility seems antithetical to group-level solutions, because it implies that one should ignore norms like “respect the decisions of boards/judges” if needed to “get the job done”, and reduces social pressure to follow such norms (by giving up the moral high ground from which one could criticize such norm violations).
You’re suggesting a very different approach, of patching heroic responsibility with anti-unilateralist curse type intuitions (on the individual level) but that’s still untried and seemingly quite risky / possibly unworkable. Until we have reason to believe that the new solution is an improvement to the existing ones, it still seems irresponsible to spread an idea that damages the existing solutions.
Hmm, I’m not sure that the idea of heroic responsibility undermines these existing mechanisms for preventing these problems, partially because I’m skeptical these existing mechanisms make much of a difference in the relevant case.
Can this be summarized as “don’t optimize for what you believe is good too hard, as you might be mistaken about what is good”?
Maybe “don’t advertise too hard that one should optimize for what they believe is good, because someone crazy will hear you and get radicalized by the message”. (such as Zizians)
Many people seem to have an instinct to translate “work hard” as “do some crazy violent action”. Just a few days ago, someone on ACX asked: “if you believe that AI is harmful, why don’t you support a terrorist group to kill the AI researchers?” For a certain mindset, this is the obvious logical response to feeling strongly about something: if you are not murdering people left and right, it means you don’t care enough about your cause.
I guess there is an evolutionary reason for this: we are running on a corrupted hardware. In our evolutionary past, successfully organizing senseless violence could be an efficient way to get to the top of the tribe, so we are tempted by instinct to propose it as a solution for various problems.
The question is, how to communicate the message so that it arrives to people who are likely to translate “work hard” as becoming stronger, learning about how stuff works, designing a solution, and testing it; but it somehow does not arrive to people who are likely to translate it as “hurt everyone who disagrees”.
Is this analogous to saying “capabilities research is dangerous and should not be pursued”, but for the human psyche rather than for AI?
Yeah, that seems a reasonable way to look at it. “Heroic responsibility” could be viewed as a kind of “unhobbling via prompt engineering”, perhaps.
I kind of doubt that leaders at big labs would self-identify as being motivated by anything like Eliezer’s notion of heroic responsibility. If any do self-identify that way though, they’re either doing it wrong or misunderstanding. Eliezer has written tons of stuff about the need to respect deontology and also think about all of the actual consequences of your actions, even (especially when) the stakes are high:
(https://glowfic.com/replies/1874768#reply-1874768)
(https://glowfic.com/replies/1940939#reply-1940939)
Starting an AI lab in order to join a doomed race to superintelligence, and then engaging in a bunch of mundane squabbles for corporate control, seems like exactly the opposite of the sentiment here:
(https://hpmor.com/chapter/93)
Also, re this example:
In general, it seems perfectly fine and normal for a founder-CEO to fight back against a board ouster—no need to bring heroic responsibility into it. Of course, all parties including the CEO and the board should stick to legal / above-board / ethical means of “fighting back”, but if there’s a genuine disagreement between the board and the CEO on how to best serve shareholder interests (or humanity’s interests, for a non-profit), why wouldn’t both sides vigorously defend their own positions and power?
Perhaps the intended reading of your example is that heroic responsibility would obligate or justify underhanded tactics to win control, when the dispute has existential consequences. But I think that’s a misunderstanding of the actual concept. Ordinary self-confidence and agency obligate you to defend your own interests / beliefs / power, and heroic responsibility says that you’re obligated to win without stepping outside the bounds of deontology or slipping into invalid / motivated reasoning.
This argument seems only convincing if you don’t have those destructive values. One man’s destructive values is another’s low-hanging fruit, and those who see low hanging fruit everywhere won’t give up on the fruit just because others may pick it.
Since bad people won’t heed your warning it doesn’t seem in good people’s interests to heed it either.
An analogy is one can make the same argument wrt rationality itself. Its dual use! Someone with bad values can use rationality to do a lot of harm! Does that mean good people shouldn’t use rationality? No!
I’m not trying to “warn bad people”. I think we have existing (even if imperfect) solutions to the problem of destructive values and biased beliefs, which “heroic responsibility” actively damages, so we should stop spreading that idea or even argue against it. See my reply to Ryan, which is also relevant here.
Ah yes, but if all these wannabe heroes keep going we’ll be really screwed, so it’s up to me to take a stand against the fools dooming us all… the ratchet of Moloch cranks ever clockwise
AI labs are starting to build AIs with capabilities that are hard for humans to oversee, such as answering questions based on large contexts (1M+ tokens), but they are still not deploying “scalable oversight” techniques such as IDA and Debate. (Gemini 1.5 report says RLHF was used.) Is this more good news or bad news?
Good: Perhaps RLHF is still working well enough, meaning that the resulting AI is following human preferences even out of training distribution. In other words, they probably did RLHF on large contexts in narrow distributions, with human rater who have prior knowledge/familiarity of the whole context, since it would be too expensive to do RLHF with humans reading diverse 1M+ contexts from scratch, but the resulting chatbot is working well even outside the training distribution. (Is it actually working well? Can someone with access to Gemini 1.5 Pro please test this?)
Bad: AI developers haven’t taken alignment seriously enough to have invested enough in scalable oversight, and/or those techniques are unworkable or too costly, causing them to be unavailable.
From a previous comment:
This seems to be evidence that RLHF does not tend to generalize well out-of-distribution, causing me to update the above “good news” interpretation downward somewhat. I’m still very uncertain though. What do others think?
Apparently Gemini 1.5 Pro isn’t working great with large contexts:
But is this due to limitations of RLHF training, or something else?
RLHF with humans might also soon get obsoleted by things like online DPO where another chatbot produces preference data for on-policy responses of the tuned model, and there is no separate reward model in the RL sense. If generalization from labeling instructions through preference decisions works in practice, even weak-to-strong setting won’t necessarily be important, if tuning of a stronger model gets bootstrapped by a weaker model (where currently SFT from an obviously off-policy instruct dataset seems to suffice), but then the stronger model re-does the tuning of its equally strong successor that starts with the same base model (as in the self-rewarding paper), using some labeling instructions (“constitution”). So all that remains of human oversight that actually contributes to the outcome is labeling instructions written in English, and possibly some feedback on them from spot checking what’s going on as a result of choosing particular instructions.
My guess is that we’re currently effectively depending on generalization. So “Good” from your decomposition. (Though I think depending on generalization will produce big issues if the model is scheming, so I would prefer avoiding this.)
It’s plausible to me that after doing a bunch of RLHF on short contexts, RLHF on long contexts is extremely sample efficient (when well tuned) such that only (e.g.) 1,000s of samples sufficies[1]. If you have a $2,000,000 budget for long context RLHF and need only 1,000 samples, you can spend $2,000 per sample. This gets you perhaps (e.g.) 10 hours of time of an experienced software engineer which might suffice for good long context supervision without necessarily needing any fancy scalable oversight approaches. (That said, probably people will use another LLM by default when trying to determine the reward if their spending this long: recursive reward modeling seems almost certain by default if we’re assuming that people spend this much time labeling.)
That said, I doubt that anyone has actually started doing extremely high effort data labeling like this, though plausibly they should...
It’s some evidence, but exploiting a reward model seems somewhat orthogonal to generalization out of distribution: exploitation is heavily selected for.
(Separately, I expect that the quoted comment results in a misleadingly negative perception of the current situation.)
I think experiments on sample efficiency of RLHF when generalizing to a new domain could be very important and are surprisingly underdone from my perspective (at least I’m not aware of interesting results). Even more important is sample efficiency in cases where you have a massive number of weak labels, but a limited number of high quality labels. It seems plausible to me that the final RLHF approach used will look like training the reward model on a combination of 100,000s of weak labels and just 1,000 very high quality labels. (E.g. train a head on the weak labels and then train another head to predict the difference between the weak label and the strong label.) In this case, we could spend a huge amount of time on each label. E.g., with 100 skilled employees we could spend 5 days on each label and still be done in 50 days which isn’t too bad of a delay. (If we’re fine with this labels trickling in for online training, the delay could be even smaller.)
Thanks for some interesting points. Can you expand on “Separately, I expect that the quoted comment results in a misleadingly perception of the current situation.”? Also, your footnote seems incomplete? (It ends with “we could spend” on my browser.)
I’m skeptical that increased scale makes hacking the reward model worse. Of course, it could (and likely will/does) make hacking human labelers more of a problem, but this isn’t what the comment appears to be saying.
Note that the reward model is of the same scale as the base model, so the relative scale should be the same.
This also contradicts results from an earlier paper by Leo Gao. I think this paper is considerably more reliable than the comment overall, so I’m inclined to believe the paper or think that I’m misunderstanding the comment.
Additionally, from first principles I think that RLHF sample efficiency should just increase with scale (at least with well tuned hyperparameters) and I think I’ve heard various things that confirm this.
Oops, fixed.
Turns out at least one scalable alignment team has been struggling for resources. From Jan Leike (formerly co-head of Superalignment at OpenAI):
Even worse, apparently the whole Superalignment team has been disbanded.
I have access to Gemini 1.5 Pro. Willing to run experiments if you provide me with an exact experiment to run, plus cover what they charge me (I’m assuming it’s paid, I haven’t used it yet).
I’m thinking that the most ethical (morally least risky) way to “insure” against a scenario in which AI takes off and property/wealth still matters is to buy long-dated far out of the money S&P 500 calls. (The longest dated and farthest out of the money seems to be Dec 2029 10000-strike SPX calls. Spending $78 today on one of these gives a return of $10000 if SPX goes to 20000 by Dec 2029, for example.)
My reasoning here is that I don’t want to provide capital to AI industries or suppliers because that seems wrong given what I judge to be high x-risk their activities are causing (otherwise I’d directly invest in them), but I also want to have resources in a post-AGI future in case that turns out to be important for realizing my/moral values. Suggestions welcome for better/alternative ways to do this.
This probably does help capitalize AI companies a little bit, demand for call options will create demand for the underlying. This is probably a relatively small effect (?), but I’m not confident in my ability to estimate this at all.
It doesn’t differentially help capitalize them compared to everything else though, right? (Especially since some of them are private.)
Do these options have a chance to default / are the sellers stable enough?
Default seems unlikely, unless the market moves very quickly, since anyone pursuing this strategy is likely to be very small compared to the market for the S&P 500.
(Also consider that these pay out in a scenario where the world gets much richer — in contrast to e.g. Michael Burry’s “Big Short” swaps, which paid out in a scenario where the market was way down — so you’re just skimming a little off the huge profits that others are making, rather than trying to get them to pay you at the same time they’re realizing other losses.)
It seems hard to buy AI companies at the moment. The only way is to buy tech giants like Microsoft, Google, nVidea which are already valuad very highly—seems like’s it’s somewhat priced in. It’s also not clear to me that most of the value of AI will accrue to them. I’m confused about this though.
It would seem one would want to buy Nasdaq rather than SPX? On the other hand, maybe most tech companies will be wiped out by AI—it’s the world of atoms that would gain relative value.
I’m also uncertain, and its another reason for going long a broad index instead. I would go even broader than S&P 500 if I could, but nothing else has option chains going out to 2029.
I find it curious that none of my ideas have a following in academia or have been reinvented/rediscovered by academia (including the most influential ones so far UDT, UDASSA, b-money). Not really complaining, as they’re already more popular than I had expected (Holden Karnofsky talked extensively about UDASSA on an 80,000 Hour podcast, which surprised me), it just seems strange that the popularity stops right at academia’s door. (I think almost no philosophy professor, including ones connected with rationalists/EA, has talked positively about any of my philosophical ideas? And b-money languished for a decade gathering just a single citation in academic literature, until Satoshi reinvented the idea, but outside academia!)
Clearly academia has some blind spots, but how big? Do I just have a knack for finding ideas that academia hates, or are the blind spots actually enormous?
I think the main reason why UDT is not discussed in academia is that it is not a sufficiently rigorous proposal, as well as there not being a published paper on it. Hilary Greaves says the following in this 80k episode:
I also think it is unclear to what extent UDT and updateless are different from existing ideas in academia that are prima facie similar, like McClennen’s (1990) resolute choice and Meacham’s (2010, §4.2) cohesive decision theory.[1] Resolute choice in particular has been discussed in a lot of detail, and for a long time (see the citations of McClennen’s book). (And, FWIW, my sense is that most philosophers think that resolute choice is irrational and/or doesn’t make sense, at least if it is cashed out as a decision rule based on preference.)
It also doesn’t help that it is unclear what the difference between FDT and UDT is supposed to be.
(If UDT is supposed to be an LDT of some sort, then you might want to check out Spohn’s (2012)[2] version of CDT, Fisher’s (n.d) disposition-based decision theory, and Poellinger’s (2013) discussion of Spohn’s theory, for ideas in academia that are similar to the LDT-part of the theory. And then there is also Schwarz’ critique of FDT, which would then also apply to UDT, at least partially.)
My own take, using the terminology listed here, is that the causalist version of Meacham’s cohesive decision theory is basically “updateless CDT”, that the evidentialist version is basically “updateless EDT”, and that a Spohn-CDT version of cohesive decision theory is basically “U(C)DT/F(C)DT”. I also think that resolute choice is much more permissive than e.g. cohesive decision theory and updatelessness. As a decision rule, it doesn’t recommend anything close to “maximizing EU relative to your prior”. Instead, it just states that (i) what you prefer ex ante in a dynamic choice problem should be the same as you what you prefer in the normalised version of the problem, and (ii) you should be dynamically consistent (i.e., the most preferred plan should not change throughout the decision problem).
Note that in the published article, it says that the article was received in 2008.
It may be worth thinking about why proponents of a very popular idea in this community don’t know of its academic analogues, despite them having existed since the early 90s[1] and appearing on the introductory SEP page for dynamic choice.
Academics may in turn ask: clearly LessWrong has some blind spots, but how big?
And it’s not like these have been forgotton; e.g., McClennen’s (1990) work still gets cited regularly.
I don’t think this is fair, because even though component ideas behind UDT/FDT have academic analogues, it doesn’t look like someone put them together into a single decision theory formulation in academic literature, at least prior to MIRI’s “Cheating Death in Damascus” being published. Also “Cheating Death in Damascus” does cite both Meacham and Spohn (and others) and it seems excusable for me to have forgotten those references since they were both published after I wrote about UDT and again were only component ideas of it, plus I haven’t actively worked on decision theory for several years.
I think Sami’s comment is entirely fair given the language and framing of the original post. It is of course fine to forget about references, but e.g. “I find it curious that none of my ideas have a following in academia or have been reinvented/rediscovered by academia” and “Clearly academia has some blind spots, but how big?” reads like you don’t consider it a possilbity that you might have re-invented something yourself, and that academics are at fault for not taking up your ideas.
(It sucks to debate this, but ignoring it might be interpreted as tacit agreement. Maybe I should have considered the risk that something like this would happen and not written my OP.)
When I wrote the OP, I was pretty sure that the specific combination of ideas in UDT has not been invented or re-invented or have much of a following in academia, at least as of 2019 when Cheating Death in Damascus was published, because the authors of that paper obviously did a literature search and would have told me if they had found something very similar to UDT in the literature, and I think I also went through the papers it referenced as being related and did not find something that had all of the elements of UDT (that’s probably why your references look familiar to me). Plus FDT was apparently considered novel enough that the reviewers of the paper didn’t tell the authors that they had to call it by the name of an existing academic decision theory.
So it’s not that I “don’t consider it a possibility that you might have re-invented something yourself” but that I had good reason to think that’s not the case?
I think there is nothing surprising that small community of nerds writing in spare time has blind spots, but when large professional community has such blind spots that’s surprising.
On your first point: as Sami writes, resolute choice is mentioned in the introductory SEP article on dynamic choice (it even has its own section!), as well as in the SEP article on decision theory. And SEP is the first place you go when you want to learn about philosophical topics and find references.
On your second point: as I wrote in my comment above, (i) academics have produced seemingly similar ideas to e.g. updatelessness (well before they were written up on LW) so it is unclear why academics should engage with less rigorous, unpublished proposals that appear to be similar (in other words, I don’t think the phrase “blind spots” is warranted), and (ii) when academics have commented on or engaged with LW DT ideas, they have to my knowledge largely been critical (e.g. see the post by Wolfgang Schwarz I linked above, as well as the quote from Greaves)[1].
Cheating Death in Damascus getting published in the Journal of Philosophy is a notable exception though!
To clarify, by “blind spot” I wasn’t complaining that academia isn’t engaging specifically with posts written up on LW, but more that nobody in academia seems to think that the combination of “updateless+logical” is clearly the most important or promising direction to explore in decision theory.
The reason for the former is that I (and others) have been unable to find a rigorous formulation of it that doesn’t have serious open problems. (I and I guess other decision theory researchers in this community currently think that UDT is more of a relatively promising direction to explore, rather than a good decision theory per se.)
And the reason for the latter is the above, plus my personal distaste for writing/publishing academic papers (which I talked about elsewhere in this thread), plus FDT having been published which seems close enough to me.
Thank for the references in the rest of your comment. I think I’ve come across Meacham 2010 and Spohn 2012 before, but forgot about them as I haven’t been working actively on decision theory for a while. It does seem that Meacham’s cohesive decision theory is equivalent to updateless EDT/CDT. (BTW in The Absent-Minded Driver I referenced a 1997 paper that also has an idea similar to updatelessness, although the authors didn’t like it.)
On a quick skim of Spohn 2012 I didn’t see something that looks like LDT or “algorithmic/logical agent ontology” but it’s quite long/dense so I’ll take your word on it for now. Still, it seems like none of the academic papers put all of the pieces together in a single decision theory proposal that’s equivalent to UDT or FDT?
(Please note that UDT as originally described was actually updateless/evidential/logical, not causalist as you wrote in the post that you linked. This has been a historical disagreement between me and Eliezer, where in I leaned towards evidential and he leans towards causal, although these days I just say that I’m confused and don’t know what to think.)
That’s fair. But what is it then that you expect academics to engage with? How would you describe this research direction, and why do you think it’s interesting and/or important?
To quickly recap the history, people on LW noticed some clear issues with “updating” and “physicalist ontology” of the most popular decision theories at the time (CDT/EDT), and thought that switching to “updatelessness” and “logical/algorithmic ontology” was an obvious improvement. (I was the first person to put the two pieces together in an explicit formulation, but they were already being talked about / hinted at in the community.) Initially people were really excited because the resulting decision theories (UDT/FDT) seemed to solve a lot of open problems in one swoop, but then pretty quickly and over time we noticed more and more problems with UDT/FDT that seem to have no clear fixes.
So we were initially excited but then increasingly puzzled/confused, and I guess I was expecting at least some academics to follow a similar path, either through engagement with LW ideas (why should they be bothered that much by lack of academic publication?), or from independent invention. Instead academia seems to still be in a state similar to LW when I posted UDT, i.e., the ideas are floating in the air separately and nobody has put them together yet? (Or I guess that was the state of academia before FDT was published in an academic journal, so now the situation is more like some outsiders put the pieces together in a formal publication, but still no academic is following a similar path as us.)
I guess it’s also possible that academia sort of foresaw or knew all the problems that we’d eventually find with UDT/FDT and that’s why they didn’t get excited in the first place. I haven’t looked into academic DT literature in years, so you’re probably more familiar with it. Do you know if they’re puzzled/confused by the same problems that we are? Or what are they mostly working on / arguing about these days?
There are many many interesting questions in decision theory, and “dimensions” along which decision theories can vary, not just the three usually discussed on LessWrong. It’s not clear to me why (i) philosophers should focus on the dimensions you primarily seem to be interested in, and (ii) what is so special about the particular combination you mention (is there some interesting interaction I don’t know about maybe?). Furthermore, note that most philosophers probably do not share your intuitions: I’m pretty sure most of them would e.g. pay in counterfactual mugging. (And I have not seen a good case for why it would be rational to pay.) I don’t mean to be snarky, but you could just be wrong about what the open problems are.
I wouldn’t say so, no. But I’m not entirely sure if I understand what the open problems are. Reading your list of seven issues, I either (i) don’t understand what you are asking, (ii) disagree with the framing/think the question is misguided, or (iii) think there is an obvious answer (which makes me think that I’m missing something). With that said, I haven’t read all the posts you reference, so perhaps I should read those first.
It would be interesting to get an overview of what these are. Or if that’s too hard to write down, and there are no ready references, what are your own interests in decision theory?
As I mentioned in the previous comment, it happens to solve (or at least seemed like a good step towards solving) a lot of problems I was interested in at the time.
Agreed, but my intuitions don’t seem so unpopular outside academia or so obviously wrong that there should be so few academic philosophers who do share them.
I’m not sure I wouldn’t pay either. I see it as more of an interesting puzzle than having a definitive answer. ETA: Although I’m more certain that we should build AIs that do pay. Is that also unclear to you? (If so why might we not want to build such AIs?)
Yeah, I’m trying to keep an open mind about that. :)
Cool, I’d be interested in any further feedback when you’re ready to give them.
Yeah, that would be too hard. You might want to look at these SEP entries: Decision Theory, Normative Theories of Rational Choice: Expected Utility, Normative Theories of Rational Choice: Rivals to Expected Utility and Causal Decision Theory. To give an example of what I’m interested in, I think it is really important to take into account unawareness and awareness growth (see §5.3 of the first entry listed above) when thinking about how ordinary agents should make decisions. (Also see this post.)
Okay, interesting! I thought UDT was meant to pay in CM, and that you were convinced of (some version of) UDT.
On the point about AI (not directly responding to your question, to which I don’t have an answer): I think it’s really important to be clear about whether we are discussing normative, constructive or descriptive decision theory (using Elliott Thornley’s distinction here). For example, the answers to “is updatelessness normatively compelling?”, “should we build an updateless AI?” and “will some agents (e.g. advanced AIs) commit to being updateless?” will most likely come apart (it seems to me). And I think that discussions on LW about decision theory are often muddled due to not making clear what is being discussed.
Thanks, will look into your references.
I wrote “I’m really not sure at this point whether UDT is even on the right track” in UDT shows that decision theory is more puzzling than ever which I think you’ve read? Did you perhaps miss that part?
(BTW this issue/doubt about whether UDT / paying CM is normative for humans is item 1 in the above linked post. Thought I’d point that out since it may not be obvious at first glance.)
Yeah I agree with this to some extent, and try to point out such confusions or make such distinctions when appropriate. (Such as in the CM / indexical values case.) Do you have more examples where making such distinctions would be helpful?
Yes, missed or forgot about that sentence, sorry.
Thanks.
I was mostly thinking about discussions surrounding what the “correct” decision theory, is whether you should pay in CM, and so on.
Thanks Sylvester! Yep it looks like cohesive decision theory is basically original UDT. Do you know what the state of the art is in terms of philosophical critiques of cohesive decision theory? Any good ones? Any good responses to the critiques?
Cohesive decision theory lacks the logical/algorithmic ontology of UDT and is closer to what we call “updateless EDT/CDT” (the paper itself talks about cohesive versions of both).
Also interested in a response from Sylvester, but I would guess that one of the main critiques is something like Will MacAskill’s Bomb thought experiment, or just intuitions for paying the counterfactual mugger. From my perspective, these do have a point when it comes to humans, since humans seemingly have indexical values, and one way to explain why UDT makes recommendations in these thought experiments that look “bizarre” to many humans is that it assumes away indexical values (via the type signature of its utility function). (It was an implicit and not totally intentional assumption, but it’s unclear how to remove the assumption while retaining nice properties associated with updatelessness.) I’m unsure if indexical values themselves are normative or philosophically justified, and they are probably irrelevant or undesirable when it comes to AIs, but I guess academic philosophers probably take them more for granted and are not as interested in AI (and therefore take a dimmer view on updatelessness/cohesiveness).
But yeah, if there are good critiques/responses aside from these, it would be interesting to learn them.
I don’t think cohesive decision theory is being discussed much, but I’m not sure. Perhaps because the theory is mainly used to argue against the claim that “every decision rule will lead agents who can’t bind themselves to disaster” (p. 20, footnote 34) in the paper, and discussion of its independent interest is relegated to a footnote (footnote 34).
OK, thanks. So then the mystery remains why academic philosophy isn’t more interested in this.
I was thinking of writing a short post kinda on this topic (EDIT TO ADD: it’s up! See Some (problematic) aesthetics of what constitutes good work in academia), weaving together:
Holden on academia not answering important questions
This tweet I wrote on the aesthetics of what makes a “good” peer-reviewed psychology paper
Something about the aesthetics of what makes a “good” peer-reviewed AI/ML paper, probably including the anecdote where DeepMind wrote a whole proper academia-friendly ML paper whose upshot was the same as a couple sentences in an old Abram Demski blog post
Something about the aesthetics of what makes a “good” peer-reviewed physics paper, based on my personal experience, probably including my anecdote about solar cell R&D from here
Not academia but bhauth on the aesthetics of what makes a “good” VC pitch
maybe a couple other things (suggestions anyone?)
Homework problem for the reader: what are your “aesthetics of success”, and how are they screwing you over?
I followed this link thinking that it looks relevant to my question, but the way Holden delineates what academia is interested in, it should totally be interested in my ideas:
versus what Holden says are important questions that academia neglects:
The rest of your comment seems to be hinting that maybe academia is ignoring my ideas because it doesn’t like the aesthetics of my writing? (Not sure if that was your point, or if those bullet points weren’t supposed to be directly related to my question...) Even if that’s true though, I’m still puzzled why academia hasn’t reinvented any of my ideas (which have been independently invented multiple times outside of academia, e.g. Nick Szabo and Satoshi with b-money, Paul Christiano with UDASSA).
Hmm, yeah I guess what I wrote wasn’t too directly helpful for your question.
I think Holden forgot “trendy”. Trendy is very important. I think people in academia have a tacit shared understanding of the currently-trending topics / questions, within which there’s a contest to find interesting new ideas / progress. If an idea is important but not trendy, it’s liable to get neglected, I think. It’s kinda like in clothing fashion: if you find a brilliant use of beads, but beads aren’t fashion-forward this year, roughly nobody will care.
Of course, the trends change, and indeed everyone is trying to be the pioneer of the next hot topic. There are a lot of factors that go into “what is the next hot topic”, including catching the interest of a critical mass of respected people (or people-who-control-funding), which in turn involves them feeling it’s “exciting”, and that they themselves have an angle for making further progress in this area, etc. But trendiness doesn’t systematically track objective importance, and it’s nobody’s job to make it so.
At least, that’s what things felt like to me in the areas of physics I worked in (optics, materials science, and related). I’m much less familiar with philosophy, economics, etc.
Remember, aside from commercially-relevant ideas, success for academia research scientists (and philosophers) is 100% determined by “am I impressing my peers?”—grants, promotions, invited talks, etc. are all determined by that. So if I write a paper and the prestigious people in my field are unanimously saying “I don’t know about that thing, it’s not an area that I know or care about”, the result is just as bad for me and my career as if those people had unanimously said “this is lousy work”.
To be clear, when I said “the aesthetic of what makes a good X”, I meant it in a really broad sense. Maybe I should have said “the implicit criteria of what makes a good X” instead. So “the paper concerns a currently-trendy topic” can be part of that, even though it’s not really “aesthetics” in the sense of beauty. E.g., “the aesthetic of what makes a good peer-reviewed experimental condensed-matter physics paper” has sometimes been greatly helped by “it somehow involves nanotechnology”.
From the years in academia studying neuroscience and related aspects of bioengineering and medicine development… yeah. So much about how effort gets allocated is not ‘what would be good for our country’s population in expectation, or good for all humanity’. It’s mostly about ‘what would make an impressive sounding research paper that could get into an esteemed journal?’, ‘what would be relatively cheap and easy to do, but sound disproportionately cool?’, ‘what do we guess that the granting agency we are applying to will like the sound of?’. So much emphasis on catching waves of trendiness, and so little on estimating expected value of the results.
Research an unprofitable preventative-health treatment which plausibly might have significant impacts on a wide segment of the population? Booooring.
Research an impractically-expensive-to-produce fascinatingly complex clever new treatment for an incredibly rare orphan disease? Awesome.
I think that UDASSA and UDT might be in academia’s blind spots in the same way that the Everett interpretation is: more correct theories that came after less correct theories with mostly only theoretical evidence to support changing over to the new theories.
Aside from the literature on international relations, I don’t know much about academic dysfunction (mostly from reading parts of Inadequate Equilibria, particularly the visitor dialog) and other Lesswrong people can probably cover it better. I think that planecrash, Yud’s second HPMOR-scale work, mentions that everyone in academia just generally avoids citing things published outside of academia, because they risk losing status if they do.
EDIT: I went and found that section, it is here:
It’s glowfic, so of course I don’t know how accurate it is as it’s intended to plausibly deniable enough to facilitate free writing (while keeping things entertaining enough to register as not-being-work).
I have to think more about the status dynamics that Eliezer talked about. There’s probably something to it… But this part stands out as wrong or at least needing nuance/explanation:
I think most academic venues do blind reviews and whoever decides whether or not to accept a paper isn’t supposed to know who wrote it? Which isn’t to say that the info won’t leak out anyway and influence the decision. (For example I once left out the acknowledgements section in a paper submission, thinking that, like the author byline, I was supposed to add it after the paper was accepted, but apparently I was actually supposed to include it and someone got really peeved that I didn’t.)
Also it seems weird that Eliezer wrote this in 2021, after this happened in 2019:
From talking with people who do work on a lot of grant committees in the NIH and similar funding orgs, it’s really hard to do proper blinding of reviews. Certain labs tend to focus on particular theories and methods, repeating variations of the same idea… So if you are familiar the general approach of a particular lab and it’s primary investigator, you will immediately recognize and have a knee-jerk reaction (positive or negative) to a paper which pattern-matches to the work that that lab / subfield is doing.
Common reactions from grant reviewers:
Positive—“This fits in nicely with my friend Bob’s work. I respect his work, I should argue for funding this grant.”
Neutral—“This seems entirely novel to me, I don’t recognize it as connecting with any of the leading trendy ideas in the field or any of my personal favorite subtopics. Therefore, this seems high risk and I shouldn’t argue too hard for it.”
Slightly negative—“This seems novel to me, and doesn’t sound particularly ‘jargon-y’ or technically sophisticated. Even if the results would be beneficial to humanity, the methods seem boring and uncreative. I will argue slightly against funding this.”
Negative—“This seems to pattern match to a subfield I feel biased against. Even if this isn’t from one of Jill’s students, it fits with Jill’s take on this subtopic. I don’t want views like Jill’s gaining more traction. I will argue against this regardless of the quality of the logic and preliminary data presented in this grant proposal.”
Ah, sorry that this wasn’t very helpful.
I will self-downvote so this isn’t the top comment. Yud’s stuff is neat, but I haven’t read much on the topic, and passing some along when it comes up has been a good general heuristic.
No need to be sorry, it’s actually great food for thought and I’m glad you pointed me to it.
Many parts of academia have a strong Not Invented Here tendency. Not just research outside of academia is usually ignored, but even research outside a specific academic citation bubble, even if another bubble investigates a pretty similar issue. For example, economic decision theorists ignore philosophical decisions theorists, which in turn mostly ignore the economic decision theorists. They each have their own writing style and concerns and canonical examples or texts. Which makes it hard for outsiders to read the literature or even contribute to it, so they don’t.
A striking example is statistics, where various fields talk about the same mathematical thing with their own idiosyncratic names, unaware or unconcerned whether it already had a different name elsewhere.
Edit: Though LessWrong is also a citation bubble to some degree.
“Read the sequences....just the sequences”
Something a better , future version of rationalism could do is build bridges and facilitate communication between these little bubbles. The answet-to-everything approach has been tried too many times.
Have you tried talking to professors about these ideas?
Indeed, there is no need for sorrow, for by choosing to remain anonymous, you have done great things. The world owes you a Nobel Prize in Economics and a Turing Award. It is time for the world to seriously recognize your achievements and lead it towards a financial system without bubbles.
Why you hadn’t wrote academic articles on these topics?
The secret is that academic article is just a formatting type and anyone can submit to scientific journals. No need to have a PhD or even work in a scientific institution.
I wrote an academic-style paper once, as part of my job as an intern in a corporate research department. It soured me on the whole endeavor, as I really didn’t enjoy the process (writing in the academic style, the submission process, someone insisting that I retract the submission to give them more credit despite my promise to insert the credit before publication), and then it was rejected with two anonymous comments indicating that both reviewers seemed to have totally failed to understand the paper and giving me no chance to try to communicate with them to understand what caused the difficulty. The cherry on top was my mentor/boss indicating that this is totally normal, and I was supposed to just ignore the comments and keep resubmitting the paper to other venues until I run out of venues.
My internship ended around that point and I decided to just post my ideas to mailing lists / discussion forums / my home page in the future.
Also, I think MIRI got FDT published in some academic philosophy journal, and AFAIK nothing came of it?
FDT paper got 29 citation, but many from MIRI affiliated people and-or on AI safety. https://scholar.google.ru/scholar?cites=13330960403294254854&as_sdt=2005&sciodt=0,5&hl=ru
One can escape troubles with reviewers by publishing in arxiv or other paper archives (philpapers). Google Scholar treats them as normal articles.
But in fact there are good journals with actually helping reviewers (e.g. Futures).
I don’t think FDT got published—as far as I can tell it’s just on arXiv.
I was referring to Cheating Death In Damascus which talks about FDT in Section 4.
There is some similarity between UDASSA and ’Law without law” by Mueller, as both use Kolmogorov complexity to predict the distribution of observers. In LwL there is not any underlying reality except numbers, so it is just dust theory over random number fields.
From someone who left a corner of it: the blindspots could be arbitrarily large as far as I know, because there seemed to me to be no real explicit culture of Hamming questions/metalooking for anything neglected. You worked on something vaguely similar/related to your advisor’s work, because otherwise you can’t get connections to people who know how to attack the problem.
Maybe Chinese civilization was (unintentionally) on the right path: discourage or at least don’t encourage technological innovation but don’t stop it completely, run a de facto eugenics program (Keju, or Imperial Examination System) to slowly improve human intelligence, and centralize control over governance and culture to prevent drift from these policies. If the West hadn’t jumped the gun with its Industrial Revolution, by the time China got to AI, human intelligence would be a lot higher and we might be in a much better position to solve alignment.
This was inspired by @dsj’s complaint about centralization, using the example of it being impossible for a centralized power or authority to deal with the Industrial Revolution in a positive way. The contrarian in my mind piped up with “Maybe the problem isn’t with centralization, but with the Industrial Revolution!” If the world had more centralization, such that the Industrial Revolution never started in an uncontrolled way, perhaps it would have been better off in the long run.
One unknown is what would the trajectory of philosophical progress look like in this centralized world, compared to a more decentralized world like ours. The West seems to have better philosophy than China, but it’s not universal (e.g. analytical vs Continental philosophy). (Actually “not universal” is a big understatement given how little attention most people pay to good philosophy, aside from a few exceptional bubbles like LW.) Presumably in the centralized world there is a strong incentive to stifle philosophical progress (similar to China historically), for the sake of stability, but what happens when average human IQ reaches 150 or 200?
What motive does a centralized dominant power have to allow any progress? The entire world would likely look more like North Korea.
A culture/ideology that says the ruler is supposed to be benevolent and try to improve their subjects’ lives, which of course was not literally followed, but would make it hard to fully suppress things that could clearly make people’s lives better, like many kinds of technological progress. And historically, AFAIK few if any of the Chinese emperors tried to directly suppress technological innovation, they just didn’t encourage it like the West did, through things like patent laws and scientific institutions.
Yes, directionally it would look more like North Korea, but I think the controls would not have to be as total or harsh, because there is less of a threat that outside ideas could rush in and overturn the existing culture/ideology the moment you let your guard down.
I don’t think patent laws are the key difference other property rights are more important.
Gutenberg was living in a free city and was able to take a loan to start a startup with 15-25 employees on the promise of an unproven technology without any patent laws protecting him. Merchants were strong enough at the time so that idea of creating a startup with the main motivation of financial return was viable.
Chinese rulers could just take whatever they wanted from merchants, so merchants were less powerful and did not make similar capital investments. When Europeans sailed across the oceans, they did it because the ability to make a profit from trade. When Chinese did it, they wanted to bring home gifts to the emperor.
The key difference is that European merchants could make very expensive capital investments that then produced returns and reinvest those returns to produce even more value.
Which period of “chinese civilisation” are you referring to? I think it would be hard to point to any isolated “chinese civilisation” just minding its own business and keeping a firm grip on a unified cultural and ethnic population. Over 3500+ years of written history the territory occupied by China today had multiple periods of unity and division, sometimes splitting up into 10 or more states, often with multiple empires and dynasties coexisting in various levels of war and peace and very loosely ruled areas in between. (This is IMO a central theme of Chinese history: the first line of the Romance of the Three Kingdoms reads “Of matters under heaven, we can say that what is long united must divide, what is long divided must unite”. At various points the “Chinese Empire” looked more like the Holy Roman Empire, e.g. during the late Zhou dynasty leading into the Spring and Autumn period)
The “chinese lands” were taken over by the Mongols and the Manchu during the Yuan and Qing dynasties (the latter one being the last dynasty before the 20th century), and at various points the borders of the Chinese empire would grow and shrink to encompass what we today recognise as Korea, Japan, South East Asia, Tibet… There are 56 recognised ethnic groups in China today. The importance and purpose of the Keju system also changed throughout the periods it was in use, and I have no idea where you got the eugenics thing from. I also think you would have a hard time building a case for any intentional or centralised control of scientific research beyond that of the European states at the time, mostly because the idea of scientific research is itself a very modern one (is alchemical research science?). As far as I can understand it you’re taking the “vibe” of a strong, unified, centralised state that people recognise today in the People’s Republic of China and then stretching it backwards to create some kind of artificial historical throughline.
(The following is written by AI (Gemini 2.5 Pro) but I think it correctly captured my position.)
You’re right to point out that I’m using a highly stylized and simplified model of “Chinese civilization.” The reality, with its dynastic cycles, periods of division, and foreign rule, was far messier and more brutal than my short comment could convey.
My point, however, isn’t about a specific, unbroken political entity. It’s about a civilizational attractor state. The remarkable thing about the system described in “Romance of the Three Kingdoms” is not that it fell apart, but that it repeatedly put itself back together into a centralized, bureaucratic, agrarian empire, whereas post-Roman Europe fragmented permanently. Even foreign conquerors like the Manchus were largely assimilated by this system, adopting its institutions and governing philosophy (the “sinicization” thesis).
Regarding the Keju, the argument isn’t for intentional eugenics, but a de facto one. The mechanism is simple: if (1) success in the exams correlates with heritable intelligence, and (2) success confers immense wealth and reproductive opportunity (e.g., supporting multiple wives and children who survive to adulthood), then over a millennium you have created a powerful, systematic selective pressure for those traits.
The core of the thought experiment remains: is a civilization that structurally, even if unintentionally, prioritizes stability and slow biological enhancement over rapid, disruptive technological innovation better positioned to handle long-term existential risks?
Say more about the de-facto eugenics program?
It’s based on the idea that Keju created a long-term selective pressure for intelligence.
The exams selected for heritable cognitive traits.
Success led to positions in the imperial government, and therefore power and wealth.
Power and wealth allowed for more wives, concubines, food, resources, and many more surviving children than the average person, which was something many Chinese consciously aimed for. (Note that this is very different from today’s China or the West, where cultural drift/evolution has much reduced or completely eliminated people’s desires to translate wealth into more offspring.)
If we made “spend money on kids” cool again, do you think we automatically get selection-for-intelligence for free, or is there another missing bit?
If people started trying earnestly to convert wealth/income into more kids, we’d come under Malthusian constraints again, and before that much backsliding in living standards and downward social mobility for most people, which would trigger a lot of cultural upheaval and potential backlash (e.g., calls for more welfare/redistribution and attempts to turn culture back against “eugenics”/”social Darwinism”, which will probably succeed just like they succeeded before). It seems ethically pretty fraught to try to push the world in that direction, to say the least, and it has a lot of other downsides, so I think at this point a much better plan to increase human intelligence is to make available genetic enhancements that parents can voluntarily choose for their kids, government-subsidized if necessary to make them affordable for everyone, which avoids most of these problems.
There would likely be some selection process, but that would be very slow compared to all the other factors at play.
But would they be at all threatened by analytic philosophy, brewing patiently in its corner as an amusement of sensible people?
The Inhumanity of AI Safety
A: Hey, I just learned about this idea of artificial superintelligence. With it, we can achieve incredible material abundance with no further human effort!
B: Thanks for telling me! After a long slog and incredible effort, I’m now a published AI researcher!
A: No wait! Don’t work on AI capabilities, that’s actually negative EV!
B: What?! Ok, fine, at huge personal cost, I’ve switched to AI safety.
A: No! The problem you chose is too legible!
B: WTF! Alright you win, I’ll give up my sunken costs yet again, and pick something illegible. Happy now?
A: No wait, stop! Someone just succeeded in making that problem legible!
B: !!!
This observation should make us notice confusion about whether AI safety recruiting pipelines are actually doing the right type of thing.
In particular, the key problem here is that people are acting on a kind of top-down partly-social motivation (towards doing stuff that the AI safety community approves of)—a motivation which then behaves coercively towards their other motivations. But as per this dialogue, such a system is pretty fragile.
A healthier approach is to prioritize cultivating traits that are robustly good—e.g. virtue, emotional health, and fundamental knowledge. I expect that people with such traits will typically benefit the world even if they’re missing crucial high-level considerations like the ones described above.
For example, an “AI capabilities” researcher from a decade ago who cared much more about fundamental knowledge than about citations might well have invented mechanistic interpretability without any thought of safety or alignment. Similarly, an AI capabilities researcher at OpenAI who was sufficiently high-integrity might have whistleblown on the non-disparagement agreements even if they didn’t have any “safety-aligned” motivations.
Also, AI safety researchers who have those traits won’t have an attitude of “What?! Ok, fine” or “WTF! Alright you win” towards people who convince them that they’re failing to achieve their goals, but rather an attitude more like “thanks for helping me”. (To be clear, I’m not encouraging people to directly try to adopt a “thanks for helping me” mentality, since that’s liable to create suppressed resentment, but it’s still a pointer to a kind of mentality that’s possible for people with sufficiently little internal conflict.) And in the ideal case, they will notice that there’s something broken about their process for choosing what to work on, and rethink that in a more fundamental way (which may well lead them to conclusions similar to mine above).
I’m not sure I actually agree with this. Can you explain how someone who is virtuous, but missing the crucial consideration of “legible vs. illegible AI safety problems” can still benefit the world? I.e., why would they not be working on some highly legible safety problem that actually is negative EV to work on?
My current (uncertain) perspective is that we actually do still need people to be “acting on a kind of top-down partly-social motivation (towards doing stuff that the AI safety community approves of)” but the AI safety community needs to get better at being strategic somehow. Otherwise I don’t see how each person can discover all of the necessary crucial considerations on their own, or even necessarily appreciate all the important considerations that the community has come up with. And I do not see why “people with such traits will typically benefit the world even if they’re missing crucial high-level considerations like the ones described above.”
(Or alternatively put all/most effort into AI pause/stop/slowdown, which perhaps does not require as much strategic finesse.)
If a person is courageous enough to actually try to solve a problem (like AI safety), and high-integrity enough to avoid distorting their research due to social incentives (like incentives towards getting more citations), and honest enough to avoid self-deception about how to interpret their research, then I expect that they will tend towards doing “illegible” research even if they’re not explicitly aware of the legible/illegible distinction. One basic mechanism is that they start pursuing lines of thinking that don’t immediately make much sense to other people, and the more cutting-edge research they do the more their ontology will diverge from the mainstream ontology.
This has pretty low argumentative/persuasive force in my mind.
Why? I’m not seeing the logic of how your premises lead to this conclusion.
And even if there is this tendency, what if someone isn’t smart enough to come up with a new line of illegible research, but does see some legible problem with an existing approach that they can contribute to? What would cause them to avoid this?
And even the hypothetical virtuous person who starts doing illegible research on their own, what happens when other people catch up to him and the problem becomes legible to leaders/policymakers? How would they know to stop working on that problem and switch to another problem that is still illegible?
Note that my comment was not optimized for argumentative force about the overarching point. Rather, you asked how they “can” still benefit the world, so I was trying to give a central example.
In the second half of this comment I’ll give a couple more central examples of how virtues can allow people to avoid the traps you named. You shouldn’t consider these to be optimized for argumentative force either, because they’ll seem ad-hoc to you. However, they might still be useful as datapoints.
Figuring out how to describe the underlying phenomenon I’m pointing at in a compelling, non-ad-hoc way is one of my main research focuses. The best I can do right now is to say that many of the ways in which people produce outcomes which are harmful (by their own lights) seem to arise from a handful of underlying dynamics. I call this phenomenon pessimization. One way in which I’m currently thinking about virtues is as a set of cognitive tools for preventing pessimization. As one example, kindness and forgiveness help to prevent cycles of escalating conflict with others, which is a major mechanism by which people’s values get pessimized. This one is pretty obvious to most people; let me sketch out some less obvious mechanisms below.
This actually happened to me: when I graduated from my masters I wasn’t cognitively capable of coming up with new lines of illegible alignment research, in part because I was too status-seeking. Instead I went to work at DeepMind, and ended up spending a lot of my time working on RLHF, which is a pretty central example of a “legible” line of research.
However, I also wasn’t cognitively capable of making much progress on RLHF, because I couldn’t see how it addressed the core alignment problem, and so it didn’t seem fundamental enough to maintain my interest. Instead I spent most of my time trying to understand the alignment problem philosophically (resulting in this sequence) at the expense of my promotion prospects.
In this case I think I had the virtue of deep curiosity, which steered my attention towards illegible problems even though my top-down plan was to contribute to alignment by doing RLHF research. These days, whatever you might think of my research, few people complain that it’s too legible.
There are other possible versions of me who had that deep curiosity but weren’t smart enough to have generated a research agenda like my current one; however, I think they would still have left DeepMind, or at least not been very productive on RLHF.
When a field becomes crowded, there’s a pretty obvious inference that you can make more progress by moving to a less crowded field. I think people often don’t draw that inference because moving to a less crowded field loses them prestige, is emotionally/financially risky, etc. Virtues help remove those blockers.
Sorry, you might be taking my dialog too seriously, unless you’ve made such observations yourself, which of course is quite possible since you used to work at OpenAI. I’m personally far from the places where such dialogs might be occurring, so don’t have any observations of them myself. It was completely imagined in my head, as a dark comedy about how counter to human (or most human’s) nature strategic thinking/action about AI safety is, and partly a bid for sympathy for the people caught in the whiplashes, to whom this kind of thinking or intuition doesn’t come naturally.
Edit: To clarify a bit more, B’s reactions like “WTF!” were written more for comedic effect, rather than trying to be realistic or based on my best understanding/predictions of how a typical AI researcher would actually react. It might still be capturing some truth, but again just want to make sure people aren’t taking my dialog more seriously than I intend.
I’m taking the dialogue seriously but not literally. I don’t think the actual phrases are anywhere near realistic. But the emotional tenor you capture of people doing safety-related work that they were told was very important, then feeling frustrated by arguments that it might actually be bad, seems pretty real. Mostly I think people in B’s position stop dialoguing with people in A’s position, though, because it’s hard for them to continue while B resents A (especially because A often resents B too).
Some examples that feel like B-A pairs to me include: people interested in “ML safety” vs people interested in agent foundations (especially back around 2018-2022); people who support Anthropic vs people who don’t; OpenPhil vs Habryka; and “mainstream” rationalists vs Vassar, Taylor, etc.
What is the “great personal cost” to shifting from AI capabilities to safety? Sure, quitting one’s frontier lab job to become an independent researcher means taking a pay cut, but that’s an opportunity cost and not really an enormous sacrifice. It’s not like any frontier labs would try and claw back your equity … again.
I’ve seen somewhere that (some) people at AI labs are thinking in terms of shares of the future lightcone, not just money.
If most of your friends are capabilities researchers who aren’t convinced that they’re work is negative EV yet, it might be pretty awkward when they ask why you’ve switched to safety.
There’s a big prestige drop (in many people’s minds, such as one’s parents’) from being at a place like OpenAI (perceived by many as a group made up of the best of the best) to being an independent researcher. (“What kind of a job is that?!”)
Having to let go of sunken costs (knowledge/skills for capabilities research) and invest in a bunch of new human capital needed for safety research.
What is going on with Constitution AI? Does anyone know why no LLM aside from Claude (at least none that I can find) has used it? One would think that if it works about as well as RLHF (which it seems to), AI companies would be flocking to it to save on the cost of human labor?
Also, apparently ChatGPT doesn’t know that Constitutional AI is RLAIF (until I reminded it) and Gemini thinks RLAIF and RLHF are the same thing. (Apparently not a fluke as both models made the same error 2 out of 3 times.)
Isn’t the basic idea of Constitutional AI just having the AI provide its own training feedback using written instruction? My guess is there was a substantial amount of self-evaluation in the o1 training with complicated written instructions, probably kind of similar to a constituion (though this is just a guess).
This is my impression too. See e.g. this recent paper from Google, where LLMs critique and revise their own outputs to improve performance in math and coding.
These posts might be relevant:
A recipe for frontier model post-training
Futures of the data foundry business model
The details of Constitutional AI seem highly contingent, while the general idea is simply automation of data for post-training, so that the remaining external input is the “constitution”. In the original paper there are recipes both for instruction tuning data and for preference data. RLAIF is essentially RLHF that runs on synthetic preference data, maybe together with a recipe for generating it. But preference data could also be used to run DPO or something else, in which case RLAIF becomes a misnomer for describing automation of that preference data.
Llama 3 report suggests that instruction tuning data can be largely automated, but human preference data is still better. And data foundry business is still alive, so a lot of human data is at least not widely recognized as useless. But it’s unclear if future models won’t soon do better than humans at labeling, or possibly already do better at some leading labs. Meta didn’t have a GPT-4 level model as a starting point before Llama 3, and then there are the upcoming 5e26 FLOPs models, and o1-like reasoning models.
As a tangent to my question, I wonder how many AI companies are already using RLAIF and not even aware of it. From a recent WSJ story:
So they detected the cheating that time, but in RLHF how would they know if contractors used AI to select which of two AI responses is more preferred?
BTW here’s a poem(?) I wrote for Twitter, actually before coming across the above story:
yyyep
Maybe others are using it in secret but don’t want to admit it for some reason? I can’t find any mention of Anthropic having filed a patent on the idea, but maybe other companies would feel too much like it looked like they were second-rate imitators if they said they were copying Anthropic’s idea?
Just speculating, I don’t know. Sure seems like a useful idea to copy.
AI companies don’t seem to be shy about copying RLHF though. Llama, Gemini, and Grok are all explicitly labeled as using RLHF.
I’m increasingly worried that philosophers tend to underestimate the difficulty of philosophy. I’ve previously criticized Eliezer for this, but it seems to be a more general phenomenon.
Observations:
Low expressed interest in metaphilosophy (in relation to either AI or humans)
Low expressed interest in AI philosophical competence (either concern that it might be low, or desire/excitement for supercompetent AI philosophers with Jupiter-sized brains)
Low concern that philosophical difficulty will be a blocker of AI alignment or cause of AI risk
High confidence when proposing novel solutions (even to controversial age-old questions, and when the proposed solution fails to convince many)
Rarely attacking one’s own ideas (in a serious or sustained way) or changing one’s mind based on others’ arguments
Rarely arguing for uncertainty/confusion (i.e., that that’s the appropriate epistemic status on a topic), with normative ethics being a sometime exception
Possible explanations:
General human overconfidence
People who have a high estimate of difficulty of philosophy self-selecting out of the profession.
Academic culture/norms—no or negative rewards for being more modest or expressing confusion. (Moral uncertainty being sometimes expressed because one can get rewarded by proposing some novel mechanism for dealing with it.)
Philosophy is frequently (probably most of the time) done in order to signal group membership rather than as an attempt to accurately model the world. Just look at political philosophy or philosophy of religion. Most of the observations you note can be explained by philosophers operating at simulacrum level 3 instead of level 1.
“Signal group membership” may be true of the fields you mentioned (political philosophy and philosophy of religion), but seems false of many other fields such as philosophy of math, philosophy of mind, decision theory, anthropic reasoning. Hard to see what group membership someone is signaling by supporting one solution to Sleeping Beauty vs another, for example.
Here are some axes along which I think there’s some group membership signaling in philosophy (IDK about the extent and it’s hard to disentangle it from other stuff):
Math: platonism/intuitionism/computationalism (i.e. what is math?), interpretations of probability, foundations of math (set theory vs univalent foundations)
Mind: externalism/internalism (about whatever), consciousness (de-facto-dualisms (e.g. Chalmers) vs reductive realism vs illusionism), language of thought vs 4E cognition, determinism vs compatibilism vs voluntarism
Metaphysics/ontology: are chairs, minds, and galaxies real? (this is somewhat value-laden for many people)
Biology: gene’s-eye-view/modern synthesis vs extended evolutionary synthesis
I don’t think this is accurate, I think most philosophy is done under motivated reasoning but is not straightforwardly about signaling group membership
I think most academic philosophers take the difficult of philosophy quite seriously. Metaphilosophy is a flourishing subfield of philosophy; you can find recent papers on the topic here https://philpapers.org/browse/metaphilosophy. There is also a growing group of academic philosophers working on AI safety and alignment; you can find some recent work here https://link.springer.com/collections/cadgidecih. I think that sometimes the tone of specific papers sounds confident; but that is more stylistic convention than a reflection of the underlying credences. Finally, I think that uncertainty / decision theory is a persistent theme in recent philosophical work on AI safety and other issues in philosophy of AI; see for example this paper, which is quite sensitive to issues about chances of welfare https://link.springer.com/article/10.1007/s43681-023-00379-1.
Thank you for your view from inside academia. Some questions to help me get a better sense of what you see:
Do you know any philosophers who switched from non-meta-philosophy to metaphilosophy because they become convinced that the problems they were trying to solve are too hard and they needed to develop a better understanding of philosophical reasoning or better intellectual tools in general? (Or what’s the closest to this that you’re aware of?)
Do you know any philosophers who have expressed an interest in ensuring that future AIs will be philosophically competent, or a desire/excitement for supercompetent AI philosophers? (I know 1 or 2 private expressions of the former, but not translated into action yet.)
Do you know any philosophers who are worried that philosophical problems involved in AI alignment/safety may be too hard to solve in time, and have called for something like an AI pause to give humanity more time to solve them? (Even philosophers who have expressed a concern about AI x-risk or are working on AI safety have not taken a position like this, AFAIK.)
How often have you seen philosophers say something like “Upon further reflection, my proposed solution to problem X has many problems/issues, I’m no longer confident it’s the right approach and now think X is much harder than I originally thought.”
Would also appreciate any links/citations/quotes (if personal but sharable communications) on these.
These are all things I’ve said or done due to high estimate of philosophical difficulty, but not (or rarely) seen among academic philosophers, at least from my casual observation from outside academia. It’s also possible that we disagree on what estimate of philosophical difficulty is appropriate (such that for example you don’t think philosophers should often say or do these things), which would also be interesting to know.
Another academic philosopher, directed here by @Simon Goldstein. Hello Wei!
It’s not common to switch entirely to metaphilosophy, but I think lots of us get more interested in the foundations and methodology of at least our chosen subfields as we gain experience, see where progress is(n’t) being made, start noticing deep disagreements about the quality of different kinds of work, and so on. It seems fair to describe this as awakening to a need for better tools and a greater understanding of methods. I recently wrote a paper about the methodology of one of my research areas, philosophy of mathematical practice, for pretty much these reasons.
Current LLMs are pretty awful at discussing the recent philosophy literature, so I think anyone who’d like AI tools to serve as useful research assistants would be happy to see at least some improvement here! I’m personally also excited about the prospects of using language models with bigger context windows for better corpus analysis work in empirical and practice-oriented parts of philosophy.
I basically agree with Simon on this.
I don’t think this is uncommon. You might not see these reversals in print often, because nobody wants to publish and few people want to read a paper that just says “I retract my previous claims and no longer have a confident positive view to offer”. But my sense is that philosophers often give up on projects because the problems are piling up and they no longer see an appealing way forward. Sometimes this happens more publicly. Hilary Putnam, one of the most influential philosophers of the later 20th century, was famous for changing his mind about scientific realism and other basic metaphysical issues. Wesley Salmon gave up his influential “mark transmission” account of causal explanation due to counterexamples raised by Kitcher (as you can read here). It would be easy enough to find more examples.
Great questions. Sadly, I don’t have any really good answers for you.
I don’t know of specific cases, but for example I think it is quite common for people to start studying meta-ethics because of frustration at finding answers to questions in normative ethics.
I do not, except for the end of Superintelligence
Many of the philosophers I know who work on AI safety would love for there to be an AI pause, in part because they think alignment is very difficult. But I don’t know if any of us have explicitly called for an AI pause, in part because it seems useless, but may have opportunity cost.
I think few of my friends in philosophy have ardently abandoned a research project they once pursued because they decided it wasn’t the right approach. I suspect few researchers do that. In my own case, I used to work in an area called ‘dynamic semantics’, and one reason I’ve stopped working on that research project is that I became pessimistic that it had significant advantages over its competitors.
The FLI Pause letter didn’t achieve a pause, but it dramatically shifted the Overton Window.
Thanks, it’s actually very interesting and important information.
I’ve noticed (and stated in the OP) that normative ethics seems to be an exception where it’s common to express uncertainty/confusion/difficulty. But I think, from both my inside and outside views, that this should be common in most philosophical fields (because e.g. we’ve been trying to solve them for centuries without coming up with broadly convincing solutions), and there should be a steady stream of all kinds of philosophers going up the meta ladder all the way to metaphilosophy. It recently dawned on me that this doesn’t seem to be the case.
What seems useless, calling for an AI pause, or the AI pause itself? Have trouble figuring out because if “calling for an AI pause”, what is the opportunity cost (seems easy enough to write or sign an open letter), and if “AI pause itself”, “seems useless” contradicts “would love”. In either case, this seems extremely important to openly discuss/debate! Can you please ask these philosophers to share their views of this on LW (or their preferred venue), and share your own views?
FTR I’d probably be up for helping out logistically with such an open letter (e.g. making the website and any other parts of it). I previously made this open letter.
I think there’s a steady stream of philosophy getting interested in various questions in metaphilosophy; metaethics is just the most salient to me. One example is the recent trend towards conceptual engineering (https://philpapers.org/browse/conceptual-engineering). Metametaphysics has also gotten a lot of attention in the last 10-20 years https://www.oxfordbibliographies.com/display/document/obo-9780195396577/obo-9780195396577-0217.xml. There is also some recent work in metaepistemology, but maybe less so because the debates tend to recapitulate previous work in metaethics https://plato.stanford.edu/entries/metaepistemology/.
Sorry for being unclear, I meant that calling for a pause seems useless because it won’t happen. I think calling for the pause has opportunity cost because of limited attention and limited signalling value; reputation can only be used so many times; better to channel pressure towards asks that could plausibly get done.
Thanks for this info and the references. I guess by “metaphilosophy” I meant something more meta than metaethics or metaepistemology, i.e., a field that tries to understand all philosophical reasoning in some unified or systematic way, including reasoning used in metaethics and metaepistemology, and metaphilosophy itself. (This may differ from standard academic terminology, in which case please let me know if there’s a preferred term for the concept I’m pointing at.) My reasoning being that metaethics itself seems like a hard problem that has defied solution for centuries, so why stop there instead of going even more meta?
I think you (and other philosophers) may be too certain that a pause won’t happen, but I’m not sure I can convince you (at least not easily). What about calling for it in a low cost way, e.g., instead of doing something high profile like an open letter (with perceived high opportunity costs), just write a blog post or even a tweet saying that you wish for an AI pause, because …? What if many people privately prefer an AI pause, but nobody knows because nobody says anything? What if by keeping silent, you’re helping to keep society in a highly suboptimal equilibrium?
I think there are also good arguments for doing something like this from a deontological or contractualist perspective (i.e. you have a duty/obligation to honestly and publicly report your beliefs on important matters related to your specialization), which sidestep the “opportunity cost” issue, but I’m not sure if you’re open to that kind of argument. I think they should have some weight given moral uncertainty.
Hm. I think modern academic philosophy is a raging shitshow, but I thought philosophy on LW was quite good. I haven’t been a regular LW user until a couple of years ago, and the philosophical takes here, particularly Eliezer’s, converge with my own conclusions after a half lifetime of looking at philosophical questions through the lens of science, particularly neuroscience and psychology.
So: what do you see as the limitations in LW/Yudkowskian philosophy? Perhaps I’ve overlooked them.
I am currently skeptical that we need better philosophy for good AGI outcomes, vs. better practical work on technical AGI alignment (a category that barely exists) and PR work to put the likely personal intent aligned AGI into the hands of people that give half a crap about understanding or implementing ethics. Deciding on the long term future will be a matter of a long contemplation if we get AGI into good hands. We should decide if that logic is right, and if so, plan the victory party after we’ve won the war.
I did read your metaphilosophy post and remain unconvinced that there’s something big the rest of us are missing.
I’m happy to be corrected (I love becoming less wrong, and I’m aware of many of my biases that might prevent it):
Here’s how it currently looks to me: Ethics are ultimately a matter of preference, the rest is game theory and science (including the science of human preferences). Philosophical questions boil down to scientific questions in most cases, so epistemology is metaphilosophy for the most part.
Change my mind! Seriously, I’ll listen. It’s been years since I’ve thought about philosophy hard.
I was just reading Daniel Dennett’s memoir for no reason in particular, it had some interesting glimpses into how professional philosophers actually practice philosophy. Like I guess there’s a thing where one person reads their paper (word-for-word!) and then someone else is the designated criticizer? I forget the details. Extremely different from my experience in physics academia though!!
(Obviously, reading that memoir is probably not the most time-efficient way to learn about the day-to-day practice of academic philosophy.)
(Oh, there was another funny anecdote in the memoir where the American professional philosopher association basically had a consensus against some school of philosophy, and everyone was putting it behind them and moving on, but then there was a rebellion where the people who still liked that school of philosophy did a hostile takeover of the association’s leadership!)
A non-ethics example that jumps to my mind is David Chalmers on the Hard Problem of Consciousness here: “So if I’m giving my overall credences, I’m going to give, 10% to illusionism, 30% to panpsychism, 30% to dualism, and maybe the other 30% to, I don’t know what else could be true, but maybe there’s something else out there.” That’s the only example I can think of but I read very very little philosophy.
What are the issues that are “difficult” in philosophy, in your opinion? What makes them difficult?
I remember you and others talking about the need to “solve philosophy”, but I was never sure what it meant by that.
You cant solve philosophy without solving epistemology, and you can’t solve philosophy without solving epistemology. And you can’t solve epistemology because of the Problem of the Criterion, which is pretty is pretty much the same as the Münchhausen Trilemma.
“Moreover, its [philosophy’s] central tool is intuition, and this displays a near-total ignorance of how brains work. As Michael Vassar observes, philosophers are “spectacularly bad” at understanding that their intuitions are generated by cognitive algorithms.”—Rob Bensinger, Philosophy, a diseased discipline.
What’s the problem?
It’s not that philosophers weirdly and unreasonably prefer intuition to empirical facts and mathematical/logical reasoning, it is that those things either don’t go far enough, or are themselves based on intuition.
“Just use empiricism” doesn’t work, because philosophy is about interpreting empirical data.
“Just use maths/logic” doesn’t work , because those things are based on axioms justified by intuitive appeal.
“Just use reductionism” doesn’t work , because its not clear what lies at the bottom of the stack, or if anything does. Logic, epistemology and ontology have been held to be First Philosophy at different times. Logic, epistemology and ontology also seen to interact. Correct ontology depends on direct epistemology..but what minds are capable of knowing depends on ontology. Logic possibly depends on ontology too, since quantum.mechanics arguable challenges traditional bivalent logic.
Philosophers don’t embrace intuitions because they think they are particularly reliable,but because they have reasoned that they can’t do without them. (At least, the other options allowed by the MuNchausen trilemna, circulatory and regress, are distinctly unattractive )That is the essence of the Inconvenient Ineradicability of Intuition. An unfounded foundation is what philosophers mean by “intuition”...and not a supernatural insight that could not have been produced by a material brain
I blame science, math, engineering, entrepreneurship. Philosophy is the practice of the esoteric method, meaning it can’t be made truly legible for very long stretches of investigation. This results in accumulation of anti-epistemic hazards, which science doesn’t particularly need to have tools for dealing with, because it can filter its reasoning through frequent transitions into legibility.
Philosophy can’t rely on such filtering through legibility, it has to maintain sanity the hard way. But as philosophy enviously looks at the more successful endeavors of science, it doesn’t see respect for such methods of maintaining sanity in its reasoning, instead it sees that merely moving fast and breaking things works very well. And so the enthusiasm for their development wanes, instead philosophy remains content with the object level questions that investigate particular truths, rather than methods for getting better at telling which cognitive algorithms can more robustly arrive at truths (rationality, metaphilosophy).
To whom does this not apply? Most people who “work on AI alignment” don’t even think that thinking is a thing.
@Nate Showell @P. @Tetraspace @Joseph Miller @Lorxus
I genuinely don’t know what you want elaboration of. Reacts are nice for what they are, but saying something out loud about what you want to hear more about / what’s confusing / what you did and didn’t understand/agree with, is more helpful.
Re/ “to whom not...”, I’m asking Wei: what groups of people would not be described by the list of 6 “underestimating the difficult of philosophy” things? It seems to me that broadly, EAs and “AI alignment” people tend to favor somewhat too concrete touchpoints like “well, suppressing revolts in the past has gone like such and such, so we should try to do similar for AGI”. And broadly they don’t credit an abstract argument about why something won’t work, or would only work given substantial further philosophical insight.
Re/ “don’t think thinking …”, well, if I say “LLMs basically don’t think”, they’re like “sure it does, I can keep prompting it and it says more things, and I can even put that in a scaffold” or “what concrete behavior can you point to that it can’t do”. Like, bro, I’m saying it can’t think. That’s the tweet. What thinking is, isn’t clear, but That thinking is should be presumed, pending a forceful philosophical conceptual replacement!
That is, in fact, a helpful elaboration! When you said
my leading hypotheses for what you could mean were:
Using thought, as a tool, has not occured to most such people
Most such people have no concept whatsoever of cognition as being a thing, the way people in the year 1000 had no concept whatsoever of javascript being a thing.
Now, instead, my leading hypothesis is that you mean:
Most such people are failing to notice that there’s an important process, called “thinking”, which humans do but LLMs “basically” don’t do.
This is a bunch more precise! For one, it mentions AIs at all.
As my reacts hopefully implied, this is exactly the kind of clarification I needed—thanks!
Sure, but you’re not preaching to the choir at that point. So surely the next step in that particular dance is to stick a knife in the crack and twist?
That is -
And if neither you nor anyone else can do either of those things… maybe it’s time to give up and say that this ‘thinking’/‘reasoning’ thing is just philosophically confused? I don’t think that that’s where we’re headed, but I find it important to explicitly acknowledge the possibility; I don’t deal in more than one epiphenomenon at a time and I’m partial to Platonism already. So if this ‘reasoning’ thing isn’t meaningfully distinguishable in some observable way from what LLMs do, why shouldn’t I simply give in?
I’ve had this tweet pinned to my Twitter profile for a while, hoping to find some like-minded people, but with 13k views so far I’ve yet to get a positive answer (or find someone expressing this sentiment independently):
This aside, there are some people around LW/rationality who seem more cautious/modest/self-critical about proposing new philosophical solutions, like MIRI’s former Agent Foundations team, but perhaps partly as a result of that, they’re now out of a job!
Yeah that was not my reaction. (More like “that’s going to be the most beautiful thing ever” and “I want to be that too”.)
No, if anything the job loss resulted from not doing so much more, much more intently, and much sooner.
Having worked on some of the problems myself (e.g. decision theory), I think the underlying problems are just very hard. Why do you think they could have done “so much more, much more intently, and much sooner”?
The type of fundamental problem that proper speculative philosophy is supposed to solve is the sort where streetlighting doesn’t work (or isn’t working, or isn’t working fast enough). But nearly all of the alignment field after like 2004 was still basically streetlighting. It was maybe a reasonable thing to have some hope in prospectively, but retrospectively it was too much investment in streetlighting, and retrospectively I can make arguments about why one should have maybe guessed that at the time. By 2018 IIRC, or certainly by 2019, I was vociferously arguing for that in AF team meetings—but the rest of the team either disagreed with me or didn’t understand me, and on my own I’m just not that good a thinker, and I didn’t find anyone else to try it with. I think they have good thoughts, but are nevertheless mostly streetlighting—i.e. not trying to take step after step of thinking at the level of speculative philosophy AND aimed at getting the understanding needed for alignment.
My understanding of what happened (from reading this) is that you wanted to explore in a new direction very different from the then preferred approach of the AF team, but couldn’t convince them (or someone else) to join you. To me this doesn’t clearly have much to do with streetlighting, and my current guess is that it was probably reasonable of them to not be convinced. It was also perfectly reasonable of you to want to explore a different approach, but it seems unreasonable to claim without giving any details that it would have produced better results if only they had listened to you. (I mean you can claim this, but why should I believe you?)
If you disagree (and want to explain more), maybe you could either explain the analogy more fully (e.g., what corresponds to the streetlight, why should I believe that they overexplored the lighted area, what made you able to “see in the dark” to pick out a more promising search area or did you just generally want to explore the dark more) and/or try to convince me on the object level / inside view that your approach is or was more promising?
(Also perfectly fine to stop here if you want. I’m pretty curious on both the object and meta levels about your thoughts on AF, but you may not have wanted to get into such a deep discussion when you first joined this thread.)
If you say to someone
and then you listen to what they say, and reask the question and interrogate their answers, IME what it very often grounds out into is something like:
So I would summarize that position as endorsing streetlighting, in a very broad sense that encompasses most math / science / technology. And this position is largely correct! My claim is that
this is probably too slow for making Friendly AI, and
maybe one could go faster by trying to more directly cleave to the core philosophical problems.
I discuss the problem more here: https://tsvibt.blogspot.com/2023/09/a-hermeneutic-net-for-agency.html
(But note that, while that essay frames things as “a proposed solution”, the solution is barely anything—more like a few guesses at pieces of methodology—and the main point is the discussion of the problem; maybe a writing mistake.)
An underemphasized point that I should maybe elaborate more on: a main claim is that there’s untapped guidance to be gotten from our partial understanding—at the philosophical level and for the philosophical level. In other words, our preliminary concepts and intuitions and propositions are, I think, already enough that there’s a lot of progress to be made by having them talk to each other, so to speak.
OK but what would this even look like?\gen
Toss away anything amenable to testing and direct empirical analysis; it’s all too concrete and model-dependent.
Toss away mathsy proofsy approaches; they’re all too formalized and over-rigid and can only prove things from starting assumptions we haven’t got yet and maybe won’t think of in time.
Toss away basically all settled philosophy, too; if there were answers to be had there rather than a few passages which ask correct questions, the Vienna Circle would have solved alignment for us.
What’s left? And what causes it to hang together? And what causes it not to vanish up its own ungrounded self-reference?
From scratch but not from scratch. https://www.lesswrong.com/posts/noxHoo3XKkzPG6s7E/most-smart-and-skilled-people-are-outside-of-the-ea?commentId=DNvmP9BAR3eNPWGBa
https://tsvibt.blogspot.com/2023/09/a-hermeneutic-net-for-agency.html
What makes you think there are any such ‘answers’, renderable in a form that you could identify?
And even if they do exist, why do you think a human being could fully grasp the explanation in finite time?
Edit: It seems quite possible that even the simplest such ‘answers’ could require many years of full time effort to understand, putting it beyond most if not all human memory capacity. i.e. By the end even those who ‘learned’ it will have forgotten many parts near the beginning.
(Upvoted since your questions seem reasonable and I’m not sure why you got downvoted.)
I see two ways to achieve some justifiable confidence in philosophical answers produced by superintelligent AI:
Solve metaphilosophy well enough that we achieve an understanding of philosophical reasoning on par with mathematical reason, and have ideas/systems analogous to formal proofs and mechanical proof checkers that we can use to check the ASI’s arguments.
We increase our own intelligence and philosophical competence until we can verify the ASI’s reasoning ourselves.
About a week ago FAR.AI posted a bunch of talks at the 2024 Vienna Alignment Workshop to its YouTube channel, including Supervising AI on hard tasks by Jan Leike.
Having finally experienced the LW author moderation system firsthand by being banned from an author’s posts, I want to make two arguments against it that may have been overlooked: the heavy psychological cost inflicted on a commenter like me, and a structural reason why the site admins are likely to underweight this harm and its downstream consequences.
(Edit: To prevent a possible misunderstanding, this is not meant to be a complaint about Tsvi, but about the LW system. I understand that he was just doing what he thought the LW system expected him to do. I’m actually kind of grateful to Tsvi to let me understand viscerally what it feels like to be in this situation.)
First, the experience of being moderated by an opponent in a debate inflicts at least the following negative feelings:
Unfairness. The author is not a neutral arbiter; they are a participant in the conflict. Their decision to moderate is inherently tied to their desire to defend their argument and protect their ego and status. In a fundamentally symmetric disagreement, the system places you at a profound disadvantage for reasons having nothing to do with the immediate situation. To a first approximation, they are as likely as you to be biased, so why do they get to be the judge?
Confusion. Consider the commenters who are also authors and manage their own threads through engagement, patience, tolerance, and a healthy dose of self-doubt. They rarely feel a need or desire to go beyond argumentation and voting (edit: at least on a platform like LW with mods pre-filtering users for suitability), so when they are deleted or banned, it creates a sense of bewilderment as to what they could have possibly done to deserve it.
Alienation. The feeling of being powerless to change the system, because so few people are like you, even in a community of people closest to you on Earth in ways of thinking. That you’re on an alien planet, or a mistake theorist surrounded by conflict theorists, with disengagement and self-imposed exile as the only ways out.
Second, this cost and its consequences are perhaps systematically underestimated because the admins are structurally immune to it. An author would almost never ban an admin, meaning they never (or rarely, perhaps on other platforms) experience these feelings. They, being authors themselves, see the author’s pain firsthand, but the commenter’s feelings are merely an abstract report at most. This seems like a source of bias that becomes obvious once pointed out, but doesn’t appear to have been made explicit before, at least as far as I’ve seen.
I don’t want to relitigate the policy in a balanced way at this point, but simply to introduce some potentially new considerations. So, admins, no need to respond now, but please keep these points in mind if you do decide to rethink the policy at some point.
It feels like there’s a confusion of different informal social systems with how LW 2.0 has been set up. Forums have traditionally had moderators distinct from posters, and even when moderators also participate in discussions on small forums, there are often informal conventions that a moderator should not put on a modhat if they are already participating in a dispute as a poster, and a second moderator should look at the post instead (you need more than one moderator for this of course).
The LW 2.0 author moderation system is what blog hosting platforms like Blogger and Substack use, and the bid seems to have been to entice people who got big enough to run their standalone successful blog back to Lesswrong. On these platforms the site administrators are very hands-off and usually only drop in to squash something actually illegal (and good luck getting anyone to talk to if they actually decide your blog needs to be wiped from the system), and the separate blogs are kept very distinct from each other with little shared site identity, so random very weird Blogger blogs don’t really create that much of an overall “there’s something off with Blogger” vibe. They just exist on their own domain and mostly don’t interact with the rest of the platform.
Meanwhile, LW is still very much in the forum mold, the posts exist in the same big pool and site moderators are very hands-on, give warnings and can be talked to. Standalone blog author tier people mostly don’t seem to have come back to post a large volume of LW threads, and the dynamics are still very forum-like, so basically now there’s just the chaotic extra element that any random person who started a forum thread can act as moderator and moderate other users as well as their individual comments on their threads, and this adds weird drama and dysfunction to the forum social dynamic. Most of the time it happens it’ll also violate the informal rule that a moderator should not start moderating the dispute they themselves got initially involved in as a non-modhat poster.
EDIT: The third system mixed in is Facebook/Twitter style social media that’s a “steppe” instead of a “valley”, meaning that you have a steady stream of complete strangers coming in and out instead of a pool of a few dozen to a few hundred people who might have been around for over a decade. You want a very low friction ban mechanism on a steppe site since a lot of first interactions will be bad and usually indicate the drive-by stranger they’re from is not worth interacting with. On a valley site the person interacting with you is much more likely to be tightly invested in the very local area, so blocking them is bigger drama generator.
@Eliezer Yudkowsky
This seems a good opportunity to let you know about an ongoing debate over the LW moderation system. rsaarelm’s comment above provides a particularly sharp diagnosis of the problem that many LWers see: author moderation imposes a “personal blog” moderation system onto a site that functions as a community forum, creating confusion, conflict, and dysfunction because the social norms of the two models are fundamentally at odds.
Even the site’s own admins seem confused. Despite defending the “blog” moderation model at every turn, the recently redesigned front-page Feed gives users no indication that by replying to a comment or post, they would be stepping into different “private spaces” with different moderators and moderation policies. It is instead fully forum-like.
Given the current confusions, we may be at a crossroads where LW can either push fully into the “personal blog” model, or officially revert back to the “forum” model that is still apparent from elements of the site’s design, and has plenty of mind share among the LW user base.
I suspect that when you made the original request for author moderation powers, it was out of intuitive personal preference. The site admins initially agreed to your request to entice you back to posting more on LW, but over the years developed a range of justifications for the system (that honestly appear to me more like rationalizations to support the original decision).
This history may be causing much of the current difficulties, because the admins may (perhaps subconsciously) worry that if they fully reevaluated the decision, it could lead to a repudiation of the system, which would necessitate going back on a commitment made to you. Therefore a reassessment from you, based in part on what we have learned living with the current “hybrid” moderation system over the past 8 years, could be invaluable in prompting the admins to also reconsider the system without the historical baggage.
To be clear, I think the above paragraph has a <50% chance of happening, and I may well be totally off in my inference of how the current system and its justifications came into being, but it seems worth checking with you, just in case.
It’s indeed the case that I haven’t been attracted back to LW by the moderation options that I hoped might accomplish that. Even dealing with Twitter feels better than dealing with LW comments, where people are putting more effort into more complicated misinterpretations and getting more visibly upvoted in a way that feels worse. The last time I wanted to post something that felt like it belonged on LW, I would have only done that if it’d had Twitter’s options for turning off commenting entirely.
So yes, I suppose that people could go ahead and make this decision without me. I haven’t been using my moderation powers to delete the elaborate-misinterpretation comments because it does not feel like the system is set up to make that seem like a sympathetic decision to the audience, and does waste the effort of the people who perhaps imagine themselves to be dutiful commentators.
Curious whether you have any guesses on what would make it seem like a sympathetic decision to the audience. My model here is that this is largely not really a technical problem, but more of a social problem (which is e.g. better worked towards by things like me writing widely read posts on moderation), though I still like trying to solve social problems with better technical solutions and am curious whether you have ideas (that are not “turn off commenting entirely”, which I do think is a bad idea for LW in particular).
I’m not sure what Eliezer is referring to, but my guess is that many of the comments that he would mark as “elaborate-misinterpretations”, I would regard as reasonable questions / responses, and I would indeed frown on Eliezer just deleting them. (Though also shrug, since the rules are that authors can delete whatever comments they want.)
Some examples that come to mind are this discussion with Buck and this discussion with Matthew Barnett, in which (to my reading of things) Eliezer seems to be weirdly missing what the other person is saying at least as much as they are missing what he is saying.
I from the frustration Eliezer expressed in those threads, I would guess that he would call these elaborate-misinterpretations.
My take is that there’s some kind of weird fuckyness about communicating about some of these topics where both sides feel exasperation that the other side is apparently obstinately mishearing them. I would indeed think it would be worse if the post author in posts like that just deleted the offending comments.
I currently doubt the Buck thread would qualify as such from Eliezer’s perspective (and agree with you there that in as much as Eliezer disagrees, he is wrong in that case).
IMO I do think it’s a pretty bad mark on LW’s reputation that posts like Matthew’s keep getting upvoted, with what seem to me like quite aggressively obtuse adversarial interpretations of what people are saying.
The existence of the latter unfortunately makes the former much harder to navigate.
I’m guessing that there are
a lotenough people like me, who have such a strong prior on “a moderator shouldn’t mod their own threads, just like a judge shouldn’t judge cases involving themselves”, plus our own experiences showing that the alternative of forum-like moderation works well enough, that it’s impossible to overcome this via abstract argumentation. I think you’d need to present some kind of evidence that it really leads to better results than the best available alternative.Nowhere on the whole wide internet works like that! Clearly the vast majority of people do not think that authors shouldn’t moderate their own threads. Practically nowhere on the internet do you even have the option for anything else.
Where’s this coming from all of a sudden? Forums work like this, Less Wrong used to work like this. Data Secrets Lox still works like this. Most subreddits work like this. This whole thread is about how maybe the places that work like this have the right idea, so it’s a bit late in the game to open up with “they don’t exist and aren’t a thing anyone wants”.
Yes, Reddit is one of the last places on the internet where this is semi-common, but even there, most subreddits are moderated by people who are active posters, and there are no strong norms against moderators moderating responses to their own comments or posts.
I agree I overstated here and that there are some places on the internet where this is common practice, but it’s really a very small fraction of the internet these days. You might bemoan this as a fate of the internet, but it’s just really not how most of the world thinks content moderation works.
There is actually a significant difference between “Nowhere on the whole wide internet works like that!” and “few places work like that”. It’s not just a nitpick, because to support my point that it will be hard for Eliezer to get social legitimacy for freely exercising author mod power, I just need that there is a not too tiny group of people on the Internet who still prefers to have no author moderation (it can be small in absolute numbers, as long as it’s not near zero, since they’re likely to congregate at a place like LW that values rationality and epistemics). The fact that there are still even a few places on the Internet that works like this makes a big difference to how plausible my claim is.
I mean, I think no, if truly there is only a relatively small fraction of people like that around, we as the moderators can just ask those people to leave. Like, it’s fine if we have to ask hundreds of people to leave, the world is wide and big. If most of the internet is on board with not having this specific stipulation, then there is a viable LessWrong that doesn’t have those people.
[ belabor → bemoan? ]
No, I don’t “need” to do that. This is (approximately) my forum. If anything you “need” to present some kind of evidence that bridges the gap here! If you don’t like it build your own forum that is similarly good or go to a place where someone has built a forum that does whatever you want here.
The point of the post is not to convince everyone, there was never any chance of that, it’s to build enough shared understanding that people understand the principles of the space and can choose to participate or leave.
Ok I misunderstood your intentions for writing such posts. Given my new understanding, will you eventually move to banning or censoring people for expressing disapproval of what they perceive as bad or unfair moderation, even in their own “spaces”? I think if you don’t, then not enough people will voluntarily leave or self-censor such expressions of disapproval to get the kind of social legitimacy that Eliezer and you desire, but if you do, I think you’ll trigger an even bigger legitimacy problem because there won’t be enough buy-in for such bans/censorship among the LW stakeholders.
This is a terrible idea given the economy of scale in such forums.
I mean, I had a whole section in the Said post about how I do think it’s a dick move to try to socially censure people for using any moderation tools. If someone keeps trying to create social punishment for people doing that, then yeah, I will ask them to please do that somewhere else but here, or more likely, leave the content up but reduce the degree to which things like the frontpage algorithm feed attention to it. I don’t know how else any norms on the site are supposed to bottom out.
Top-level posts like this one seem totally fine. Like, if someone wants to be like “I am not trying to force some kind of social punishment on anyone, but I do think there is a relevant consideration here, but I also understand this has been litigated a bunch and I am not planning to currently reopen that”, then that’s fine. Of course you did kind of reopen it, which to be clear I think is fine on the margin, but yeah, I would totally ask you to stop if you did that again and again.
I think an issue you’ll face is that few people will “try to socially censure people for using any moderation tools”, but instead different people will express disapproval of different instances of perceived bad moderation, which adds up to that a large enough share of all author moderation gets disapproved of (or worse blow up into big dramas), such that authors like Eliezer do not feel there’s enough social legitimacy to really use them.
(Like in this case I’m not following the whole site and trying to censure anyone who does author moderation, but speaking up because I myself got banned!)
And Eliezer’s comment hints why this would happen: the comments he wants to delete are often highly upvoted. If you delete such comments, and the mod isn’t a neutral third party, of course a lot of people will feel it was wrong/unfair and want to express disapproval, but they probably won’t be the same people each time.
How are you going to censor or deprioritize such expressions of disapproval? By manual mod intervention? AI automation? Instead of going to that trouble and cause a constant stream of resentment from people feeling wronged and silenced, it seems better for Eliezer to just mark the comments that misinterpret him as misinterpretations (maybe through the react system or a more prominent variation of it, if he doesn’t want to just reply to each one and say “this is a misinterpretation). One idea is reacts from the OP author are distinguished or more prominently displayed somehow.
No, my guess is this is roughly the issue. I think the vast majority of complaints here tend to be centered in a relatively small group of people who really care.
It’s not a particularly common expectation that people have about how the internet works, as I have said in other places in this thread. I don’t think the rest of the internet gets these kinds of things right, but I also don’t think that there will be an unquenchable torrent of continuous complaints that will create a landscape of perpetual punishment for anyone trying to use moderation tools.
I think if you resolve a few disagreements, and moderate a relatively small number of people, you end up at an equlibrium that seems a bunch saner to me.
The rest of the Internet is also not about rationality though. If Eliezer started deleting a lot of highly upvoted comments questioning/criticizing him (even if based on misinterpretations like Eliezer thinks), I bet there will be plenty of people making posts like “look at how biased Eliezer is being here, trying to hide criticism from others!” These posts themselves will get upvoted quite easily, so this will be a cheap/easy way to get karma/status, as well as (maybe subconsciously) getting back at Eliezer for the perceived injustice.
I don’t know if Eliezer is still following this thread or not, but I’m also curious why he thinks there isn’t enough social legitimacy to exercise his mod powers freely, whether its due to a similar kind of expectation.
I mean, yes, these dynamics have caused many people, including myself, to want to leave LessWrong. It sucks. I wish people stopped. Not all moderation is censorship. The fact that it universally gets treated as such by a certain population of LW commenters is one of the worst aspects of this site (and one of the top reasons why in the absence of my own intervention into reviving the site, this site would likely no longer exist at all today).
I think we can fix it! I think it unfortunately takes a long time, and continuous management and moderation to slowly build trust that indeed you can moderate things without suddenly everyone going insane. Maybe there are also better technical solutions.
Claiming this is about “rationality” feels like mostly a weird rhetorical move. I don’t think it’s rational to pretend that unmoderated discussion spaces somehow outperform moderated ones. As has been pointed out many times, 4Chan is not the pinnacle of internet discussion. Indeed, I think largely across the internet, more moderation results in higher trust and higher quality discussions (not universally, you can definitely go on a censorious banning spree as a moderator and try to skew consensus in various crazy ways, but by and large, as a correlation).
This is indeed an observation so core to LessWrong that Well-Kept Gardens Die By Pacifism was, as far as I can tell, a post necessary for LessWrong to exist at all.
I’m not saying this, nor are the hypothetical people in my prediction saying this.
We are saying that there is an obvious conflict of interest when an author removes a highly upvoted piece of criticism. Humans being biased when presented with COIs is common sense, so connecting such author moderation with rationality is natural, not a weird rhetorical move.
The rest of your comment seems to be forgetting that I’m only complaining about authors having COI when it comes to moderation, not about all moderation in general. E.g. I have occasional complaints like about banning Said, but generally approve of the job site moderators are doing on LW. Or if you’re not forgetting this, then I’m not getting your point. E.g.
I have no idea how this related to my actual complaint.
Look, we’ve had these conversations.
I am saying the people who are moderating the spaces have the obvious information advantage about their own preferences and about what it’s actually like to engage with an interlocutor, plus the motivation advantage to actually deal with it. “It’s common sense that the best decisions get made by people with skin in the game and who are most involved with the actual consequences of the relevant decision”. And “it’s common sense that CEOs of organizations make hiring and firing decisions for the people they work with, boards don’t make good firing decisions, the same applies to forums and moderators”.
This is a discussion as old as time in business and governance and whatever. Framing your position as “common sense” is indeed just a rhetorical move, and I have no problem framing the opposite position in just as much of an “obvious” fashion. Turns out, neither position obviously dominates by common sense! Smart people exist on both sides of this debate. I am not against having it again, and I have my own takes on it, but please don’t try to frame this as some kind of foregone conclusion in which you have the high ground.
I was (and largely am) modeling you as being generically opposed to basically any non-spam bans or deletions on the site. Indeed, as I think we’ve discussed, the kind of positions that you express in this thread suggest to me that you should be more opposed to site-wide bans than author bans (since site-wide bans truly make counterveiling perspectives harder to find instead of driving them from the comment sections to top-level posts).
If you aren’t against site-wide bans, I do think that’s a pretty different situation! I certainly didn’t feel like I was empowered to moderate more in our conversations on moderation over the last year. It seemed to me you wanted both less individual author moderation, and less admin moderation for anything that isn’t spam. Indeed, I am pretty sure, though I can’t find it, that you said that LW moderation really should only establish a very basic level of protection against spam and basic norms of discourse, but shouldn’t do much beyond that, but I might be misremembering.
If you do support moderation, I would be curious about you DMing me some example of users you think we should ban, or non-spam comments we should delete. My current model of you doesn’t really think those exist.
I think you’re right that I shouldn’t have latched onto the first analogy I thought of. Here’s a list of 11 (for transparency, analogies 3-10 were generated by Gemini 3.0 Pro, though some may have appeared in previous discussions.):
The CEO & The Corporation
The Judge & The Courtroom
The Dinner Party Host
The University Classroom / Professor
The Conference Breakout Session
Open Source / GitHub Maintainer
The Stand-Up Comedian & The Heckler
The Art Gallery Opening
Graffiti on a Private House
The Town Hall vs Private Meetings
The Hypothetical HOA
I decided to put detailed analysis of these analogies in this collapsed section, as despite extensive changes by me from the original AI-generated text, it doesn’t quite read like my style. Also, it might be too much text and my summary/conclusions below may be sufficient to convey the main points.
1. The CEO & The Corporation
Analogy: A Forum Post is a “Project.” The Author is the CEO; the Commenter is an Employee. The CEO needs the power to fire employees who disrupt the vision, and the Board (Admins) should defer to the CEO’s judgment.
Disanalogy: In a corporation, the Board cannot see daily operations, creating information asymmetry; on a forum, Admins see the exact same content as the Author. A CEO has a smaller conflict of interest when firing an employee, because they are judged primarily by the company’s financial performance rather than the perception of their ideas. If they fire an employee who makes a good criticism, they might subsequently look better to others, but the company’s performance will suffer.
Conclusion: The analogy fails because the Author lacks the financial alignment of a CEO and possesses no special private information that the Admins lack.
2. The Judge & The Courtroom
Analogy: When there is a conflict in the physical world, we find disinterested parties to make enforceable judgments, even if the cost is very high. When the cost is too high, we either bear it (wait forever for a trial date) or give up the possibility of justice or enforcement, rather than allow an interested party to make such judgments.
Disanalogy: A courtroom has the power of Coercion (forcing the loser to pay, go to jail, or stop doing something). A Forum Author only has the power of Dissociation (refusing to host the commenter’s words). We require neutral judges to deprive people of rights/property; we do not require neutral judges to decide who we associate with.
Conclusion: Dissociation has its own externalities (e.g., hiding of potentially valuable criticism), which we usually regulate via social pressure, or legitimize via social approval, but you don’t want this and therefore need another source of legitimacy.
3. The Dinner Party Host
Analogy: A Post is a private social gathering. The Author is the Host. The Host can kick out a guest for any reason, such as to curate the conversation to his taste.
Disanalogy: In the real world, if a Host kicks out a guest that everyone else likes, the other attendees would disapprove and often express such disapproval. There is no mechanism to then suppress such disapproval, like you seek.
Conclusion: You want the power of the Host without the social accountability that naturally regulates a Host’s behavior.
4. The University Classroom / Professor
Analogy: The Author is a Subject Matter Expert (Professor). The Commenter is a Student. The Dean (Admin) lets the Professor silence students to prevent wasting class time.
Disanalogy: A classroom has a “scarce microphone” (only one person can speak at a time); a forum has threaded comments (parallel discussions), so the “Student” isn’t stopping the “Professor” from teaching. Additionally, LessWrong participants are often peers, not Student/Teacher.
Conclusion: The justification for silencing students (scarcity of time/attention, asymmetry of expertise) does not apply to LW.
5. The Conference Breakout Session
Analogy: The Author is like an Organizer who “rented the room” at a convention. The Organizer has the right to eject anyone to accomplish his goals.
Disanalogy: Just like the Dinner Party, an Organizer would almost never eject someone who is popular with their table. If they did, the table would likely revolt.
Conclusion: This analogy fails to justify the action of overriding the local consensus (upvotes) of the participants in that sub-thread.
6. Open Source / GitHub Maintainer
Analogy: A Post is a Code Repository. A Comment is a Pull Request. The Maintainer has the absolute right to close a Pull Request as “Wontfix” or “Off Topic” to keep the project focused.
Disanalogy: In Open Source, a rejected Pull Request is Closed, not Deleted. The history remains visible, easy to find, and auditable. Also, this situation is similar to the CEO in that the maintainer is primarily judged on how well their project works, with the “battle of ideas” aspect a secondary consideration.
Conclusion: You are asking for more power for an Author than a Maintainer, and a Maintainer has less COI for reasons similar to a CEO.
7. The Stand-Up Comedian & The Heckler
Analogy: The Author is a Comedian. The Commenter is a Heckler. Even if the Heckler is funny (Upvoted), they are stealing the show. The Club (Admins) protects the Comedian because writing a set is high-effort.
Disanalogy: In a physical club, the Heckler interrupts the show. In a text forum, the comment sits below the post. The audience can consume the Author’s “set” without interference before reading the comment.
Conclusion: The physical constraints that justify silencing a heckler do not exist in a digital text format.
8. The Art Gallery Opening
Analogy: The Post is a Painting. The Upvoted Comment is a Critic framing the art negatively. The Artist removes the Critic to preserve the intended Context of the work.
Disanalogy: Art is about aesthetics and subjective experience. LessWrong is ostensibly about intellectual progress and truth-seeking.
Conclusion: Prioritizing “Context” over “Criticism” serves goals that are not LW’s.
9. Graffiti on a Private House
Analogy: A Post is the Author’s House. A Comment is graffiti. The homeowner has the right to scrub the wall (Delete) so neighbors don’t see it.
Disanalogy: This is purely about property value and aesthetics.
Conclusion: Again the goals are too different for the analogy to work.
10. The Town Hall vs Private Meetings
Analogy: In the real world we have both town halls (Neutral Moderator) and meetings in private houses (Author Control). We can have both.
Disanalogy: Even in the discussions inside a private house, social norms usually prevent a host from kicking out a guest who is making popular points that everyone else agrees with.
Conclusion: The social legitimacy that you seek doesn’t exist here either.
11. The Hypothetical HOA
Analogy: A hypothetical residential community with HOA rules that say, a homeowner not only has the right to kick out any guests during meetings/parties, but no one is allowed to express disapproval for exercising such powers. Anyone who buys a house in the community is required to sign the HOA agreement.
Disanalogy: There are already many people in the LW community who never “signed” such agreements.
Conclusion: You are proposing to ask many (“hundreds”) of the existing “homeowners” (some of whom have invested years of FTE work into site participation) to leave, which is implausible in this hypothetical analogy.
Overall Conclusions
None of the analogies are perfect, but we can see some patterns when considering them together.
Neutral, disinterested judgement is a standard social technology for gaining legitimacy. In the case of courts, it is used to legitimize coercion, an otherwise illegitimate activity that would trigger much opposition. In the case of a forum, it can be used to legitimize (or partly legitimize) removing/hiding/deprioritizing popular/upvoted critiques.
Some analogies provide a potential new idea for gaining such legitimacy in some cases: relatively strong and short external feedback loops like financial performance (for the CEO) and real-world functionality (for the open source maintainer) can legitimize greater unilateral discretion. This can potentially work on certain types of posts, but most lack such short-term feedback.
In other cases, suppression of dissent is legitimized for specific reasons clearly not applicable to LW, such as clear asymmetry of expertise between speaker and audience, or physical constraints.
In the remaining cases, the equivalent of author moderation (e.g., kicking out a houseguest) is legitimized only by social approval, but this is exactly what you and Eliezer want to avoid.
Having gone through all of these possible analogies, I think my intuition for judges/courts being the closest analogy to moderation is correct after all: in both cases, disinterested judgement seems to be the best or only way to gain social legitimacy for unpopular decisions.
However, this exercise also made me realize that in most of the real world we do allow people to unilaterally exercise the power of dissociation, as long as it’s regulated by social approval or disapproval, and this may be a reasonable prior for LW.
Perhaps the strongest argument (for my most preferred policy of no author moderation, period) at this point is that unlike the real world, we lack clear boundaries to signal when we are entering a “private space”, nor is it clear how much power/responsibility the authors are supposed to have, with the site mods also being around. The result is a high cost of background confusion (having to track different people’s moderation policies/styles or failing to do so and being surprised) as well as high probability of drama/distraction whenever it is used, because people disagree or are confused about the relevant norms.
On the potential benefits side, the biggest public benefits of moderation can only appear when it’s against the social consensus, otherwise karma voting would suffice as a kind of moderation. But in this case clearly social approval can’t be a source of legitimacy, and if disinterested judgment and external feedback are also unavailable as sources of legitimacy, then it’s hard to see what can work. (Perhaps worth reemphasizing here, I think this intuitive withholding of legitimacy is correct, due to the high chance of abuse when none of these mechanisms are available.) This leaves the private psychological benefit to the author, which is something I can’t directly discuss (due to not having a psychology that wants to “hard” moderate others), and can only counter with the kind of psychological cost to author-commenters like myself, as described in the OP.
@Ben Pace I’m surprised that you’re surprised. Where did your impression that I generally disapprove of the job site moderators are doing on LW come from, if you can recall?
In the last year I’d guess you’ve written over ten thousand words complaining about LW moderation over dozens of comments, and I don’t recall you ever saying anything positive about the moderation? I recall once said that you won’t leave the site over our actions (so far), which sounds like you’ll bear our moderation, but is quite different from saying it’s overall good.
Thanks, to clarify some more in case it’s helpful, I think I’ve only complained about 2 things, the Said banning and the author moderation policy, and the word count was just from a lot of back and forth, not the number issues I’ve had with the mod team? A lot of what you do is just invisible to me, like the user pre-filtering that habryka mentioned and the routine moderation work, but I assume you’re doing a good job on them, as I’m pretty happy with the general LW environment as far as lack of spam, generally good user behavior, and not seeing many complaints about being unfairly moderated by the mod team, etc.
Found my quote about not leaving:
Yeah I think you misinterpreted it. I was just trying to say that unlike those who got what they wanted (the author mod policy) by leaving or threatening to leave, I’m explicitly not using this threat as a way to get what I want. It was a way to claim the moral high ground I guess. Too bad the message misfired.
rsaarelm gave an excellent explanation early on about how the issue seems to be an incompatibility between forum mechanics and blog mechanics, rather than an issue with moderation itself. It would be unfortunate if the point was overlooked because it misunderstood as “moderation is bad”.
It is fair to say that a blog with a policy “I’ll moderate however I like, if you don’t like it leave” works fine. It’s the default and implicit.
When it comes to a forum system with as many potential posters as there are commenters then “If you don’t like it leave” is the implicit ultimatum from every single user to every other. But if the feed system that governs content exposure doesn’t allow leaving individual posters, then the only thing that could be left is the entire forum.
This is why all other significant sites with a many producers → many consumers model all have unsubscribe, mute and/or block features. It helps ensure a few weeds in the Well-Kept Garden don’t drive away all the plants with low toxin tolerance.
It sounds like—particularly from testimony from habryka and Eliezer—moving to a more meta-blog like system is/was critical to lesswrong being viable. Which means leaning in to that structure and fully implementing the requisite features seems like an easy way to improve the experience of everyone.
I am perhaps misreading, but think this sentence should be interpreted as “if you want to convince [the kind of people that I’m talking about], then you should do [X, Y, Z].” Not “I unconditionally demand that you do [X, Y, Z].”
This comment seems like a too-rude response to someone who (it seems to me) is politely expressing and discussing potential problems. The rudeness seems accentuated by the object level topic.
Off-the-cuff idea, probably a bad on:
Stopping short of “turning off commenting entirely”, being able to make comments to a given post subject to a separate stage of filtering/white-listing. The white-listing criteria are set by the author and made public. Ideally, the system is also not controlled by the author directly, but by someone the author expects to be competent at adhering to those criteria (perhaps an LLM, if they’re competent enough at this point).
The system takes direct power out of the author’s hands. They still control the system’s parameters, but there’s a degree of separation now. The author is not engaging in “direct” acts of “tyranny”.
It’s made clear to readers that the comments under a given post have been subject to additional selection, whose level of bias they can estimate by reading the white-listing criteria.
The white-listing criteria are public. Depending on what they are, they can be (a) clearly sympathetic, (b) principled-sounding enough to decrease the impression of ad-hoc acts of tyranny even further.
(Also, ideally, the system doing the selection doesn’t care about what the author wants beyond what they specified in the criteria, and is thus an only boundedly and transparently biased arbiter.)
The commenters are clearly made aware that there’s no guarantee their comments on this post will be accepted, so if they decide to spend time writing them, they know what they’re getting into (vs. bitterness-inducing sequence where someone spends time on a high-effort comment that then gets deleted).
There’s no perceived obligation to respond to comments the author doesn’t want to respond to, because they’re rejected (and ideally the author isn’t even given the chance to read them).
There are no “deleting a highly-upvoted comment” events with terrible optics.
Probably this is still too censorship-y, though? (And obviously doesn’t solve the problem where people make top-level takedown posts in which all the blacklisted criticism is put and then highly upvoted. Though maybe that’s not going to be as bad and widespread as one might fear.)
Look, I know I wouldn’t want to use LessWrong if I couldn’t ban annoying commenters from my posts. I wouldn’t run LessWrong if I didn’t have the ability to ban annoying commenters from the site.
Yes, if there were trusted enough moderators who would do the job well-enough for me, then I could get away without moderation tools when posting, but currently, if I was an external party, I would not trust even the current LW team with a copy of myself to do this. The LW team is trying to allow a bunch of people with incompatible conversation styles to be on the site, and this means that I cannot expect site-wide bans and deletion to be sufficient for making it so I don’t have run-ins with a bunch of people who will very predictably waste hours of my time.
Like, de-facto we try to compensate a bunch for this by asking people what kind of moderation they would like on their post, and then try to help them with moderation, but it’s a labor intensive process that mostly fails for bandwidth reasons.
And look, it sucks. You show up with a top-level thread where you say:
And then you just fully ignore this intention, and tag a bunch of people to please weigh in on a dispute, absolutely asking me to respond. Like, sure, you can open up this thread again, and every time anyone takes any moderation decision ever on the site, but please at least don’t say that you are not going to do it, and then totally do it.
I’m totally fine with site moderators moderating in an disinterested way. It’s the “moderators modding their own thread” thing that I’m objecting to. Kind of strange that you’re still writing as if you don’t get this.
Yeah, I realize I’m being inconsistent. That was my original intention, but @rsaarelm subsequently came in with a really clear and concise description of the core problem, and it seemed like too good of an opportunity not to pivot a bit. (I don’t think I tagged anyone besides Eliezer though?) Of course I don’t blame you for wanting to respond in this situation.
I mean, that’s what my previous sentence was about. The logical structure was implicitly “Look, I know I wouldn’t want to use LessWrong if I couldn’t ban annoying commenters from my posts because I also wouldn’t run LessWrong if I didn’t have the ability to ban annoying commenters from the site.”
And then I continued explaining that trusting site-moderators to handle every case for me is a tricky task that I don’t expect site moderators are capable of. Indeed, it’s not even clear a copy of me could handle it, because global governance is indeed just different from local governance, and it’s very hard to wear both hats at the same time.
Cool, if we are on the same page that there was a shift here, I feel better about this. I agree there was an opportunity there (though I disagree that rsaarelm’s post was a particularl good summary of the situation, but it seems fine for you to believe that).
I think it kind of sucks that this did indeed invalidate my ability to trust that threads like this aren’t threads I have to follow in great detail and managed in costly ways, which then makes discussions about this kind of stuff worse in the future.
The intended UI here is the same as what happens on current blogposts, where these guidelines show up at the bottom of the commenting box. It’s a bug it doesn’t show up, and my guess is we’ll fix it this week.
Like, I think this is a valuable piece of information, but definitely not one I would consider crucial. If you go to Twitter[1], or Facebook or any part of the rest of the internet, there is a universal assumption that the author of the post you are commenting on has moderation privileges on that post. People understand that a feed is a conglomerate of content from lots of different places. “No indication” is just false, it’s about the same level of indication as the whole rest of the internet has.
In case of Twitter an author can hide any tweets made by anyone else, which is not quite like deleting, but close enough for the purpose of this discussion
“Fundamentally at odds” seems way too strong to me, so I assume that I’m missing something.
In practice, Authors utilize their moderation privileges so rarely that there seems to be no difference whatsoever between the user experience in the “full-forum model” vs. the “private spaces” model? Like the difference has never bothered me, or impacted my behavior at all?
And indeed, I think that authors rarely use their moderation privileges because the private spaces on LessWrong are are built on top a platform that runs on the forum model. Authors don’t need to aggressively moderate their posts, because the LessWrong mod team does a lot of the work that would otherwise need to be done by the authors. In practice, we’re mostly relying on the forum model, but with an extra, rarely invoked, layer of the “private spaces” model for handling some exceptional cases. Which overall seems to work just fine to me?
Also, the author’s moderation policies are displayed just below the text box every time you post a comment? That seems importantly different than “no indication”, unless I’m misunderstanding what you mean.
Overall, I don’t get why this feels like such a big deal to you, yet.
What I mean is that on forums there is an expectation that mods will be fair to all sides, and this is in part achieved by the rule of not modding their own threads. If you feel like a mod abused their power (e.g. deleted content that didn’t violate the rules) you can often appeal to the other mods.
On a personal blog such expectations do not exist. If you get modded however unfairly, you just suck it up and move on. Since these expectations are totally opposite, when you mix the two models together on LW it becomes very confusing what one should expect.
It could be that “fundamentally at odds” is worded too strongly though. Let me know what you think given the above.
Right, but the site admins are trying to encourage people to use it more, so I thought I’d report my own experience of being author-banned, as a warning to them. I also think if they’re not used more, then the author mod powers should just be removed, to fix the above mentioned confusion, which is there regardless of how much the powers are actually used.
I’m talking specifically about the comment boxes in the Feed section of the front page, which do not have such policies displayed, as of this writing.
FWIW, this is just a bug (as I think I mentioned somewhere else in the thread).
I think it was also a desire to get people who liked a steppe style system to post. In particular, I recall Eliezer saying that he wanted a system similar to his Facebook page, where he can just ban an annoying commenter with a couple of clicks and be done with it.
Thank you, this seems like a very clear and insightful description of what is confusing and dysfunctional about the current situation.
To add some of my personal thoughts on this, the fact that the Internet always had traditional forums with the forum model of moderation shows that model can work perfectly well, and there is no need for LW to also have author moderation, from a pure moderation (as opposed to attracting authors) perspective. And “standalone blog author tier people” not having come back in 8 years since author mod was implemented means it’s time to give up on that hope.
LW is supposed to be a place for rationality, and the forum model of moderation is clearly better for that (by not allowing authors to quash/discourage disagreement or criticism). “A moderator shouldn’t mod their own threads” is such an obviously good rule and widely implemented on forums, that sigh… I guess I’ll stop here before I start imputing impure motives to the site admins again, or restart a debate I don’t really want to have at this point.
Thanks, that was a clear way to describe both perspectives here. Very helpful.
I stopped by to lesswrong for the first time in a decade and (due to the familiar author) this was the first post that caught my attention in my feed. I’m shocked.
The new policy:
* Allows post authors to suppress any disagreement or correction at will.
* STILL does not allow blocking users
* Force feeds these private and immune from criticism posts on all participants in a collective feed, with no way to opt out, mute, or block any egregious abusers.
This is a feature combination that isn’t seen on any successful site. For good reason. As others have mentioned sites where authors control participation always rely on readers being able to opt in, then opt out if the author deviates from good faith contribution (in the reader’s view). Forums where you cannot opt out do not allow conflicts of interest in moderation. Anyone moderating on their own disagreements is (and should be) viewed with suspicion.
Honestly, even the refusal to allow users to block and mute each other has always been a toxic trait of lesswrong. But being force fed posts and comments from bad faith debaters with no ability to refute, block or mute would make for a site even worse than x.com, at least in terms of technology for facilitating healthy conversation.
It should have been immediately obvious that as soon as posts become author controlled spaces, readers must be able to choose which authors they follow. How was this able to happen? Were there no adults in the room?
hi, just registering that Tsvi warned me before using mod tools recently, I updated, I was not banned, and it seemed fine. I generally think your reaction to this is reasonable and understandable, I was stressed by getting warned, but I tried to react in a way that would result in not getting banned. I think that the effect on how I write is not going to be that I censor myself, just that I think more before commenting on some people’s posts.
I do think there’s such a thing as giving a moderator too much power over a situation. but I also want to register that the noise any time this happens is frustrating and leads me to want features related to muting drama-related topics.
that said, I’m interested in what you think of bluesky’s model, where blocking someone hides their replies from your threads, but nobody can delete each others’ replies, so the replies are still out there and can and do turn into separate discussions when the people involved in the reply are still interested in them. in my opinion it’s a bit too strong because it means gentle disagreement across factions is still hard to have happen but it produces strong incentive for disagreement across factions to be cordial, which I think has mildly increased sanity of the people on bluesky (but… phew, there’s a long way to go and the bluesky population’s view on AI seems pretty insane in a way that seems immune to evidence), but I like it in many ways; it’s weaker than the LW/facebook model where you can just ban anyone.
I think the answer to this is, “because the post, specifically, is the author’s private space”. So they get to decide how to conduct discussion there (for reference, I always set moderation to Easy Going on mine, but I can see a point even to Reign of Terror if the topic is spicy enough). The free space for responses and rebuttals isn’t supposed to be the comments of the post, but the ability to write a different post in reply.
I do agree that in general if it comes to that—authors banning each other from comments and answering just via new posts—then maybe things have already gotten a bit too far into “internet drama” land and everyone could use some cooling down. And it’s generally probably easier to keep discussions on a post in the comments of the post. But I don’t think the principle is inherently unfair; you have the same exact rights as the other person and can always respond symmetrically, that’s fairness.
I want to just note, for the sake of the hypothesis space, a probably-useless idea: There could somehow be more affordance for a middle ground of “offshoot” posting. In other words, structurally formalize / enable the pattern that Anna exhibited in here comment here:
https://www.lesswrong.com/posts/AZwgfgmW8QvnbEisc/cfar-update-and-new-cfar-workshops?commentId=N2r5xTerxfxtfeLCJ
on her post, where she asked for a topic to be budded off to another venue. Adele then did so here:
https://www.lesswrong.com/posts/n299hFwqBxqwJfZyN/adele-lopez-s-shortform?commentId=k326Yx3vYBzQntS4j
And the ensuing discussion seemed productive. This kinda a bit like quote-tweeting as opposed to replying. The difference between just making your own shortform post would be that it’s a shortform post, but also paired with a comment on the original post. This would be useful if, as in the above example, the OP author asked for a topic to be discussed in a different venue; or if a commenter wants to discuss something, and also notify the author, and also make their comment visible to other people reading the comments on the OP, but wants to have their own venue or wants to avoid taking up attention in the OP because of off-topic or whatever reason.
That’s not a bad idea. You could link something like “this post is a reply to X” and then people could explore “threads” of posts that are all rebuttals and arguments surrounding a single specific topic. Doesn’t even need to be about things that have gotten this hostile, sometimes you just want to write a full post because it’s more organic than a comment.
Right. I realized later that I framed this as something the commenter decides; it would also be possible to have this sort of thing replace authors deleting comments or ban users. The author could press the “boot” button, and then this boots the comment out of the comment section. But it doesn’t delete it, it just moves all discussion to wherever the comment was booted to (e.g. open thread or quick take or something). Maybe it also hides most of the comment, and shows a single response from the author. (Not especially advocating for this.)
I think that’s the official explanation, but even the site admins don’t take it seriously. Because if this is supposed to be true, then why am I allowed to write and post replies directly from the front page Feed, where all the posts and comments from different authors are mixed together, and authors’ moderation policies are not shown anywhere? Can you, looking at that UI, infer that those posts and comments actually belong to different “private spaces” with different moderators and moderation policies?
This is indeed a notable design flaw. In principle, I think it allows one to get a highly upvoted top comment on a post which entirely misunderstands or doesn’t have anything to do with the top-level post.
Can I, looking at that UI, see how to get others “private spaces” out of my brain? The core mechanic of reading the site appears to be non-consensual exposure to others private (sometimes) nonsense.
@habryka I talked about why I almost never delete or ban here. I guess that comment talks about a “commitment” to not do this, but I’ve internalized the reason enough that I also just don’t feel a need or desire for it.
I understand that you don’t! But almost everyone else who I do think has those attributes does not have those criteria. Like, Scott Alexander routinely bans people from ACX, even Said bans people from datasecretslox. I am also confident that the only reason why you would not ban people here on LW, is because the moderators are toiling for like 2 hours a day to filter out the people obviously ill-suited for LessWrong.
Sidechannel note: Said wishes it to be known that he neither bans people from DSL nor customarily has the right to, the task being delegated to moderators rather than the sysop. ( https://share.obormot.net/textfiles/MINHjLX7 )
Sure! I was including “setting up a system that bans other people” in my definition here. I am not that familiar with how DSL works, but given that it bans people, and it was set up by Said, felt confident that thereby somehow Said chose to build a system that does ban people.
Though if Said opposes DSL banning people (and he thinks the moderators are making a mistake when doing so) then I would want to be corrected!
I think he just objected to the phrasing. I do think “set up a system where people can be banned by others whom Said does not instruct on who to ban” is a stretch for “Said bans people from DSL.”
I have generally found Said to mean the things he says quite literally and to expect others to do so as well. It’s painful to read a conversation where one person keeps assigning subtext to another who quite clearly never intended to put it there.
Another reason for not wanting to moderate is that I’ve never regretted not moderating on LW, nor can recall any instances where some discussion of my post might have gone much better if I had. For example sometimes a discussion goes off in a direction that I’m not interested in, but I just ignore it and let it develop however the participants want. Or if someone is being dumb or does something wrong, I just downvote it and ignore it (maybe give an explanation if I feel like it). I can’t recall anything particularly bad happening as a result of these decisions.
If the benefits of author moderation are as low as they seem to me from my own experience, I just can’t imagine it being worth the costs. Am I just very lucky, or what?
Did Scott or Said ban people on LW when they were here? If not, then I would amend that part to say that on a platform like LW with moderators doing a reasonable job pre-filtering people, people like me don’t feel a need or desire to ban. Which doesn’t seem to change my point much. If yes, then it would appear that I overgeneralized from my own example.
I don’t know about Scott. Him being personally active on the site was long before my tenure as admin, and I am not even fully sure how moderation or deletion at the time worked.
I don’t think Said ever banned anyone, though he also wrote only a very small number of top-level posts, so there wasn’t much opportunity. My guess is he wouldn’t have even if he had been writing a lot of top-level posts.
More substantively, I think my feelings and policies are fundamentally based on a (near) symmetry between the author and commenter. If they are both basically LW users in good standing, why should the author get so much more power in a conflict/disagreement.[1] So this doesn’t apply to moderating/filtering out users who are just unsuitable for LW or one’s own site.
I mean I understand you have your reasons, but it doesn’t remove the unfairness. Like if in a lawsuit for some reason a disinterested judge can’t be found, and the only option is to let a friend of the plaintiff be the judge, that “reason” is not going to remove the unfairness.
Ok thanks, I put in an edit to note your point.
I do think there is a bunch of truth to this, but I am active on many other forums, and have e.g. been issued moderation warnings on the EA Forum, so I do experience moderation in other contexts (and of course get blocked on Twitter from time to time). Also, I… think authors are not that much less likely to ban moderators from their posts than other users. Of the maybe 30 users who have ever been banned from other user posts, one of them is a moderator:
I am sure that if Eliezer was more active on the site, my guess is people would be a bunch more likely to ban him from their posts than they would other people for the same behavior. In general, in my experience, tallest-poppy dynamics are stronger in the rationality community than leadership-deference dynamics.
I’m thinking less deference to leaders, more not wanting to piss off people with power over an area that you care about.
What’s your preferred policy alternative? How could it be abused? What are your contingencies for when it inevitably does get abused? (Like, from what I recall about LW1.0′s history, lack of content moderation and protection from spammers was something that supposedly almost killed the site, or something.)
I’ve talked about that elsewhere and don’t want to get into it again. Probably most of it was here but I’m not totally sure. Edit: oh actually you’re talking about a different issue, which does not actually seem to be a problem that the author moderation system is supposed to address, as I’ve almost never seen spammers on my own posts or anyone else’s.
What do people think about having more AI features on LW? (Any existing plans for this?) For example:
AI summary of a poster’s profile, that answers “what should I know about this person before I reply to them”, including things like their background, positions on major LW-relevant issues, distinctive ideas, etc., extracted from their post/comment history and/or bio links.
“Explain this passage/comment” based on context and related posts, similar to X’s “explain this tweet” feature, which I’ve often found useful.
“Critique this draft post/comment.” Am I making any obvious mistakes or clearly misunderstanding something? (I’ve been doing a lot of this manually, using AI chatbots.)
“What might X think about this?”
Have a way to quickly copy all of someone’s posts/comments into the clipboard, or download as a file (to paste into an external AI).
I’ve been thinking about doing some of this myself (e.g., update my old script for loading all of someone’s post/comment history into one page), but of course would like to see official implementations, if that seems like a good idea.
These things can be done well with a browser extension, and this way they can work on other sites too. I recently made some browser extensions and it’s really easy.
That said, I’d rather people didn’t do this. The walls between human beings are already pretty thick. It’s alienating enough to talk with someone by text on the internet, with all the chances of misunderstanding that implies; it’ll be even harder if I know the other person is responding to an AI-rewritten version of my comment, referring to an AI-summarized version of my profile, running AI hypotheticals on how I would react, and so on. The human would be almost invisible behind the layers of colored glass.
I think all of these are better than the likely alternatives though, which is that
I fail to understand someone’s comment or the reasoning/motivations behind their words, and most likely just move on (instead of asking them to clarify)
I have little idea what their background knowledge/beliefs are when replying to them
I fail to consider some people’s perspectives on some issue
It also seems like I change my mind (or at least become somewhat more sympathetic) more easily when arguing with an AI-representation of someone’s perspective, maybe due to less perceived incentive to prove that I was right all along.
This seems like one-shot reasoning though. If you extend it to more people, the end result is a world where everyone treats understanding people as a chore to be outsourced to AI. To me this is somewhere I don’t want to go; I think a large part of my values are chores that I don’t want to outsource. (And in fact this attitude of mine began quite a few steps before AI, somewhere around smartphones.)
Hmm, I find it hard to understand or appreciate this attitude. I can’t think of any chores that I intrinsically don’t want to outsource, only concerns that I may not be able to trust the results. What are some other examples of chores you do and don’t want to outsource? Do you have any pattern or explanation of where you draw the line? Do you think people who don’t mind outsourcing all their chores are wrong in some way?
There’s no “line” per se. The intuition goes something like this. If my value system is only about receiving stuff from the universe, then the logical endpoint is a kind of blob that just receives stuff and doesn’t even need a brain. But if my value system is about doing stuff myself, then the logical endpoint is Leonardo da Vinci. To me that’s obviously better. So there are quite a lot of skills—like doing math, playing musical instruments, navigating without a map, or understanding people as in your example—that I want to do myself even if there are machines that could do it for me cheaper and better.
Unless one of the things you want to receive from the universe is to be like Leonardo da Vinci, or be able to do everything effortlessly and with extreme competence. Why “do chores” now if you can get to that endpoint either way, or maybe even more likely if you don’t “do chores” because it allows you to save on opportunity costs and better deploy your comparative advantage? (I can understand if you enjoy the time spent doing these activities, but by calling them “chores” you seem to be implying that you don’t?)
Well, there’s no point in asking the AI to make me good at things if I’m the kind of person who will just keep asking the AI to do more things for me! That path just leads to the consumer blob again. The only alternative is if I like doing things myself, and in that case why not start now. After all, Leonardo himself wasn’t motivated by the wish to become a polymath, he just liked doing things and did them. Even when then they’re a bit difficult (“chores”).
Anyway that was the theoretical argument, but the practical argument is that it’s not what’s being offered now. We started talking about outsourcing the task of understanding people to AI, right? That doesn’t seem like a step toward Leonardo to me! It would make me stop using a pretty important part of my mind. Moreover, it’s being offered by corporations that would love to make me dependent, and that have a bit of history getting people addicted to stuff.
But I’m only asking the AI to do things for me because they’re too effortful or costly. If the AI made me good at these things with no extra effort or cost (versus asking the AI to do it) then why wouldn’t I do them myself? For example I’m pretty sure I’d love the experience of playing like a concert pianist, and would ask for this ability, if doing so involved minimal effort and cost.
On the practical side, I agree that atrophy and being addicted/exploited are risks/costs worth keeping in mind, but I’ve generally made tradeoffs more in the direction of using shortcuts to minimize “doing chores” (e.g., buying a GPS for my car as soon as they came out, giving up learning an instrument very early) and haven’t regretted it so far.
(This thread is getting a bit long, and we might not be convincing each other very much, so hope it’s ok if I only reply with points I consider interesting—not just push-pull.)
With the concert pianist thing I think there’s a bit of type error going on. The important skill for a musician isn’t having fast fingers, it’s having something to say. Same as: “I’d like to be able to write like a professional writer”—does that mean anything? You either have things you want to write in the way that you want to write, or there’s no point being a writer at all, much less asking an AI to make you one. With music or painting it’s the same. There’s some amount of technique required, but you need to have something to say, otherwise there’s no point doing it.
So with that in mind, maybe music isn’t the best example in your case. Let’s take an area where you have something to say, like philosophy. Would you be willing to outsource that?
Outsourcing philosophy is the main thing I’ve been trying to do, or trying to figure out how to safely do, for decades at this point. I’ve written about it in various places, including this post and my pinned tweet on X. Quoting from the latter:
Aside from wanting to outsource philosophy to ASI, I’d also love to have more humans who could answer these questions for me. I think about this a fair bit and wrote some things down but don’t have any magic bullets.
(I currently think the best bet to eventually getting what I want is to encourage an AI pause along with genetic enhancements for human intelligence, have the enhanced humans solve metaphilosophy and other aspects of AI safety, then outsource the rest of philosophy to ASI, or have the enhanced humans decide what to do at that point.)
BTW I thought this would be a good test for how competent current AIs are at understanding someone’s perspective so I asked a bunch of them how Wei Dai would answer your question, and all of them got it wrong on the first try, except Claude Sonnet 4.5 which got it right on the first try but wrong on the second try. It seems like having my public content in their training data isn’t enough, and finding relevant info from the web and understanding nuance are still challenging for them. (GPT-5 essentially said I’d answer no because I wouldn’t trust current AIs enough, which is really missing the point despite having this whole thread as context.)
Yeah, I wouldn’t have predicted this response either. Maybe it’s a case of something we talked about long ago—that if a person’s “true values” are partly defined by how the person themselves would choose to extrapolate them, then different people can end up on very diverging trajectories. Like, it seems I’m slightly more attached to some aspects of human experience that you don’t care much about, and that affects the endpoint a lot.
I would like to do more work on this kind of stuff, and expect to do so after a current big batch of back-end refactors is done (not commenting on whether we might do any of these specific AI features, but it seems clear that we will want to figure out how to integrate AI into both discussion and content production on LW somehow).
Thanks for letting me know. Is there anything on my list that you don’t think is a good idea or probably won’t implement, in which case I might start working on them myself, e.g. as a userscript? Especially #5, which is also useful for other reasons, like archiving and searching.
I think we are unlikely to do #2 based on my current guesses of what are good ideas. I think #1 is also kind of unlikely. I think some version of 3,4 and 5 are definitely things I want to explore.
Hey, it’s been 6 months. Can I get an updated ETA on 5 please? If it’s going to take much longer, please let me know and I’ll just code up something myself.
We have a bunch of UI that I would need to modify only a tiny bit to get you #5, I think.
If you imagine the UI at lesswrong.com/autocompleteSettings, but with a “copy to clipboard” button at the bottom, and a user search menu at the top (instead of just having Gwern, Eliezer and Scott), would that work for you?
(Note that the list-plus icon button that you get when hovering over a list entry allows you to mark all elements above the item in the list you are hovering over as checked, so you don’t have to click on dozens of comments manually)
Need: A way to load all comments and posts of a user. Right now it only loads the top N by karma.
Want: A “download” button, for some users who have up to hundreds of MB of content, too unwieldy to copy/paste. Ability to collate/sort in various ways, especially as flat list of mixed posts and comments, sorted by posting date from oldest to newest.
The “load all” and “download” parts are a bit tricky. I don’t super want to give people buttons to download hundreds of megabytes of data from our database, which is at greater risk of causing our DB to fall over and delay other queries people are waiting on. Probably it’s doable, but it makes the feature a bit more complicated.
I’ll give it a quick try tomorrow or by latest end of next week and see whether I can get you what you want. I’ll let you know if turns out more than a hour or two complicated and I might delay it further.
I’m a huge fan, especially for the user-specific, ephemeral uses like you describe. “Summarize the major contrasting views to this post” would be awesome. I’m much less happy with publication and posting-support uses that would be the obvious things to do.
Are humans fundamentally good or evil? (By “evil” I mean something like “willing to inflict large amounts of harm/suffering on others in pursuit of one’s own interests/goals (in a way that can’t be plausibly justified as justice or the like)” and by “good” I mean “most people won’t do that because they terminally care about others”.) People say “power corrupts”, but why isn’t “power reveals” equally or more true? Looking at some relevant history (people thinking Mao Zedong was sincerely idealistic in his youth, early Chinese Communist Party looked genuine about wanting to learn democracy and freedom from the West, subsequent massive abuses of power by Mao/CCP lasting to today), it’s hard to escape the conclusion that altruism is merely a mask that evolution made humans wear in a context-dependent way, to be discarded when opportune (e.g., when one has secured enough power that altruism is no longer very useful).
After writing the above, I was reminded of @Matthew Barnett’s AI alignment shouldn’t be conflated with AI moral achievement, which is perhaps the closest previous discussion around here. (Also related are my previous writings about “human safety” although they still used the “power corrupts” framing.) Comparing my current message to his, he talks about “selfishness” and explicitly disclaims, “most humans are not evil” (why did he say this?), and focuses on everyday (e.g. consumer) behavior instead of what “power reveals”.
At the time, I replied to him, “I think I’m less worried than you about “selfishness” in particular and more worried about moral/philosophical/strategic errors in general.” I guess I wasn’t as worried because it seemed like humans are altruistic enough, and their selfish everyday desires limited enough that as they got richer and more powerful, their altruistic values would have more and more influence. In the few months since then, I’ve became more worried, perhaps due to learning more about Chinese history and politics...
My model is that the concept of “morality” is a fiction which has 4 generators that are real:
People have empathy, which means they intrinsically care about other people (and sufficiently person-like entities), but, mostly about those in their social vicinity. Also, different people have different strength of empathy, a minority might have virtually none.
Superrational cooperation is something that people understand intuitively to some degree. Obviously, a minority of people understand it on System 2 level as well.
There is something virtue-ethics-like which I find in my own preferences, along the lines of “some things I would prefer not to do, not because of their consequences, but because I don’t want to be the kind of person who would do that”. However, I expect different people to differ in this regard.
The cultural standards of morality, which it might be selfishly beneficial to go along with, including lying to yourself that you’re doing it for non-selfish reasons. Which, as you say, becomes irrelevant once you secure enough power. This is a sort of self-deception which people are intuitively skilled at.
The reason I said “most humans are not evil” is because I honestly don’t think the concept of evil, as normally applied, is a truthful way to describe most people. Evil typically refers to an extraordinary immoral behavior, in the vicinity of purposefully inflicting harm to others in order to inflict harm intrinsically, rather than out of indifference, or as a byproduct of instrumental strategies to obtain some other goal. I think the majority of harms that most people cause are either (1) byproducts of getting something they want, which is not in itself bad (e.g. wanting to eat meat), or (2) the result of their lack of will to help others (e.g. refusing to donate any income to those in poverty).
By contrast, I focused on consumer behavior because the majority of the world’s economic activity is currently engaged in producing consumer products and services. There exist possible worlds in which this is not true. During World War 2, the majority of GDP in Nazi Germany was spent on hiring soldiers, producing weapons of war, and supporting the war effort more generally—which are not consumer goods and services.
Focusing on consumer preferences a natural thing to focus on if you want to capture intuitively “what humans are doing with their wealth”, at least in our current world. Before focusing on something else by default—such as moral preferences—I’d want to hear more about why those things are more likely to be influential than ordinary consumer preferences in the future.
You mention one such argument along these lines:
I just think it’s not clear it’s actually true that humans get more altruistic as they get richer. For example, is it the case that selfish consumer preferences have gotten weaker in the modern world, compared to centuries ago when humans were much poorer on a per capita basis? I have not seen a strong defense of this thesis, and I’d like to see one before I abandon my focus on “everyday (e.g. consumer) behavior”.
Ok, I guess we just define/use it differently. I think most people we think of as “evil” probably justify inflicting harm to others as instrumental to some “greater good”, or are doing it to gain or maintain power, not because they value it for its own sake. I mean if someone killed thousands of people in order to maintain their grip on power, I think we’d call them “evil” and not just “selfish”?
I’m pretty sure that billionaires consume much less as percent of their income, compared to the average person. EA funding comes disproportionately from billionaires, AFAIK. I personally spend a lot more time/effort on altruistic causes, compared to if I was poorer. (Not donating much though for a number of reasons.)
I’m thinking that we just haven’t reached that inflection point yet, where most people run out of things to spend selfishly on (like many billionaires have, and like I have to a lesser extent). As I mentioned in my reply to your post, a large part of my view comes from not being able to imagine what people would spend selfishly on, if each person “owned” something like a significant fraction of a solar system. Why couldn’t 99% of their selfish desires be met with <1% of their resources? If you had a plausible story you could tell about this, that would probably change my mind a lot. One thing I do worry about is status symbols / positional goods. I tend to view that as a separate issue from “selfish consumption” but maybe you don’t?
I like the insight regarding power corrupting or revealing. I think perhaps both might be true and, if so, we should keep both lines of though in mind when thinking about these types of questions.
My general view is that most people are generally good when you’re talking about individual interactions. I’m less confident in that when one brings in the in group-out of group aspects. I just am not sure how to integrate all that into a general view or princple about human nature.
A line I heard in some cheesey B-grade horror movies, related to the question of a personal nature and the idea that we all have competing good and bad wolves inside. One of the characters asks which wolve was strongest, the good wolf or the bad wolf. The answer was “Which do you feed the most?”
Yes, obviously.
The vast majority of people would inflict huge amounts of disutility on others if they thought they could get away with it and benefitted from it.
What then prevents humans from being more terrible to each other? Presumably, if the vast majority of people are like this, and they know that the vast majority of others are also like this, up to common knowledge, I don’t see how you’d get a stable society in which people aren’t usually screwing each other a giant amount.
Any thoughts on why, if it’s obvious, it’s seldomly brought up around here (meaning rationalist/EA/AI safety circles)?
I don’t think altruism is evolutionarily connected to power as you describe. Caesar didn’t come to power by being better at altruism, but by being better at coordinating violence. For a more general example, the Greek and other myths don’t give many examples of compassion (though they give many other human values), it seems the modern form of compassion only appeared with Jesus, which is too recent for any evolutionary explanation.
So it’s possible that the little we got of altruism and other nice things are merely lucky memes. Not even a necessary adaptation, but more like a cultural peacock’s tail, which appeared randomly and might fix itself or not. While our fundamental nature remains that of other living creatures, who eat each other without caring much.
I think the way morality seems to work in humans is that we have a set of potential moral values, determined by our genes, that culture can then emphasize or de-emphasize. Altruism seems to be one of these potential values, that perhaps got more emphasized in recent times, in certain cultures. I think altruism isn’t directly evolutionarily connected to power, and it’s more like “act morally (according to local culture) while that’s helpful for gaining power” which translates to “act altruistically while that’s helpful for gaining power” in cultures that emphasize altruism. Does this make more sense?
Yeah, that seems to agree with my pessimistic view—that we are selfish animals, except we have culture, and some cultures accidentally contain altruism. So the answer to your question “are humans fundamentally good or evil?” is “humans are fundamentally evil, and only accidentally sometimes good”.
I think that there is a version of an altruistic pursuit where one will, by default, “reduce his power.” I think this scenario happens when, in the process of attempting to do good, one exposes himself more to unintended consequences. The person who sacrifices will reduce his ability to exercise power, but he may regain or supersede such loss if the tribe agrees with his rationale for such sacrifice.
Just because it was not among the organizing principles of any of the literate societies before Jesus does not mean it is not part of the human mental architecture.
There are several levels in which humans can be bad or evil:
Doing bad things because they believe them to be good
Doing bad things while not caring whether they are bad or not
Doing bad things because they believe them to be bad (Kant calls this “devilish”)
I guess when humans are bad, they usually do 1). Even Hitler may have genuinely thought he is doing the morally right thing.
Humans also sometimes do 2), for minor things. But rarely if the anticipated bad consequences are substantial. People who consistently act according to 2) are called psychopaths. They have no inherent empathy for other people. Most humans are not psychopathic.
Humans don’t do 3), they don’t act evil for the sake of it. They aren’t devils.
I have a feeling that for many posts that could be posted as either normal posts or as shortform, they would get more karma as shortform, for a few possible reasons:
lower quality bar for upvoting
shortforms showing some of the content, which helps hook people in to click on it
people being more likely to click on or read shortforms due to less perceived effort of reading (since they’re often shorter and less formal)
This seems bad because shortforms don’t allow tagging and are harder to find in other ways. (People are already more reluctant to make regular posts due to more perceived risk if the post isn’t well received, and the above makes it worse.) Assuming I’m right and the site admins don’t endorse this situation, maybe they should reintroduce the old posting karma bonus multiplier, but like 2x instead of 10x, and only for positive karma? Or do something else to address the situation like make the normal posts more prominent or enticing to click on? Perhaps show a few lines of the content and/or display the reading time (so there’s no attention penalty for posting a literally short post as a normal post)?
Some months ago, I suggested that there could be an UI feature to automatically turn shortforms into proper posts if they get sufficent karma, that authors could turn on or off.
One potential issue is that this makes posting shortforms even more attractive, so you might see everything being initially posted as shortforms (except maybe very long effortposts) since there’s no downside to doing that. I wonder if that’s something the admins want to see.
I agree that this seems like a likely effect.
It seems like the quality of short form writing that displaces what would otherwise have been full posts will generally be lower. But on the other hand, people might feel more willing to publish at all, because they don’t have to make the assessment of whether or not they’re good enough to be worth making a bid that other people read it.
I want a lesswrong canny—here’s vrchat’s canny for comparison. Canny (or similar systems, eg “the feature requests post”) are nice because upvoted features have no particular need of being implemented. It also means that the “why users want this” feedback channel is higher bandwidth than just guess-test-and-ask; with a central place for people to comment on features and indicate their preferences, it’s much easier for someone to go see what feature suggestions there are and add their feedback. I have several feature requests I’d add which the LW team has been hesitant about, and I think that if they made affordance for many users to comment on potential features, it would become clearer which ones are actually wanted by many people and why; and yet it would not force the LW team to implement any one feature, so requests that they’d prefer to reject or satisfy a different way would be still possible.
So, I formally request that the LW team make a feature requests post and pin it somewhere discoverable but not overly prominent, eg the top right user menu.
A persistent place to look for feedback which creates common knowledge of desired features seems likely to me (70% ish?) to make it obvious that shortform-to-post is one of the top 4 most desired features out of dozens.
And because you can read them without loading a new page. I think that’s a big factor for me.
[Tangent:]
There is a sort of upside to this, in that to the extent that people are more inclined to post shortforms than longforms due to the lower perceived/expected effort of the former, there is a possibility of (optional?) UX engineering to make writing longforms feel a bit more like writing shortforms, so that people who have something to write but also have a feeling of “ugh, that would be a lot of effort, I’ll do it when I’m not as tired [or whatever]” would be more inclined to write and post it.
Relatedly, every few days, I find myself writing some long and detailed message in a DM, which I would be less motivated to write in my personal notes, let alone write a blog post about it, and sometimes the message turns out to look like a first draft of a blog post.[1] How to hijack this with UX?[2]
After I started talking about it, I found out that apparently “write an article like a message to an intellectual-peer friend” is something like a folk advice.
Of course, also: How to hijack this with stuff other than UX?
I think posts should be displayed more like shortforms so that one isn’t limited to a title to make the argument for reading more. Ideally, it would come with a recommendation for how to put “who should read this post, and when? what do they get out of it?” info at the top of the post, within blurb length limit.
Math and science as origin sins.
From Some Thoughts on Metaphilosophy:
It occurred to me that from the perspective of longtermist differential intellectual progress, it was a bad idea to invent things like logic, mathematical proofs, and scientific methodologies, because it permanently accelerated the wrong things (scientific and technological progress) while giving philosophy only a temporary boost (by empowering the groups that invented those things, which had better than average philosophical competence, to spread their culture/influence). Now we face the rise of China and/or AIs, both of which seem likely (or at least plausibly) to be technologically and scientifically (but not philosophically) competent, perhaps in part as a result of technological/scientific (but not philosophical) competence having been made legible/copyable by earlier philosophers.
If only they’d solved metaphilosophy first, or kept their philosophy of math/science advances secret! (This is of course not entirely serious, in case that’s not clear.)
I’d suggest that trying to understand what values are would potentially have been a better direction to emphasize. Our understanding here is still pre-Socratic, basically pre-cultural.
It seems to me that values have been a main focus of philosophy for a long time, with moral philosophy (or perhaps meta-ethics if the topic is “what values are”) devoted to it and discussed frequently both in academia and out, whereas metaphilosophy has received much less attention. This implies that we know progress on understanding values is probably pretty hard on the current margins, whereas there’s a lot more uncertainty about the difficulty of metaphilosophy. Solving the latter would also be of greater utility, since it makes solving all other philosophical problems easier, not just values. I’m curious about the rationale behind your suggestion.
Specifically the question of “what values are” I don’t think has been addressed (I’ve looked around some, but certainly not thoroughly). A key problem with previous philosophy is that values are extreme in how much they require some sort of mental context (https://www.lesswrong.com/posts/HJ4EHPG5qPbbbk5nK/gemini-modeling). Previous philosophy (that I’m aware of) largely takes the mental context for granted, or only highlights the parts of it that are called into question, or briefly touches on it. This is pretty reasonable if you’re a human talking to humans, because you do probably share most of that mental context. But it fails on two sorts of cases:
trying to think about or grow/construct/shape really alien minds, like AIs;
trying to exert human values in a way that is good but unnatural (think for example of governments, teams, “superhuman devotion to a personal mission”, etc.).
The latter, 2., might have, given more progress, helped us be wiser.
My comment was responding to
So I’m saying, in retrospect on the 2.5 millennia of philosophy, it plausibly would have been better to have an “organology, physiology, medicine, and medical enhancement” of values. To say it a different way, we should have been building the conceptual and introspective foundations that would have provided the tools with which we might have been able to become much wiser than is accessible to the lone investigators who periodically arise, try to hack their way a small ways up the mountain, and then die, leaving mostly only superficial transmissions.
I would agree pretty strongly with some version of “metaphilosophy is potentially a very underserved investment opportunity”, though we don’t necessarily agree (because of having “very different tastes” about what metaphilosophy should be, amounting to not even talking about the same thing). I have ranted several times to friends about how philosophy (by which I mean metaphilosophy—under one description, something like “recursive communal yak-shaving aimed at the (human-)canonical”) has barely ever been tried, etc.
I know @Wei Dai’s post isn’t entirely serious, but I want to flag that the position that we could have understood values/philosophy without knowing about math/logic is a fictional world/fabricated option.
It cannot exist, and updateleness can be taken too far with compute constraints.
Maybe but I don’t believe that you know this. Lots of important concepts want to be gotten at by routes that don’t use much math or use quite different math from “math to understand computers” or “math to formalize epistemology”. Darwin didn’t need much math to get lots of the core structure of evolution by natural selection on random mutation.
Perhaps more seriously, the philosophers who got a temporary manpower and influence boost from the invention of math and science should have worked much harder to solve metaphilosophy, while they had the advantage.
It’s quite possible that we have solved metaphilosophy, in the direction of philosophy being insoluble.
This contradicts my position in Some Thoughts on Metaphilosophy. What about that post do you find unconvincing, or what is your own argument for “philosophy being insoluble”?
Talk about “values” is very popular on LessWrong, but much less common in philosophy or natural language. I confess I don’t even know what you mean with “trying to understand what values are”. Can you make the problem statement more precise, perhaps without reference to “values”?
Old High rationalism had a world view in which rationality, AI and ethics were all roughly the same thing: the optimisation of utility/value (which are the same thing). Rationality was von Neumann rationality; an AI was a utility maximising agent; the only possible theory of ethics is utilitarianism, IE utility maximisation, IE value maximisation.
No, that’s part of the problem. There’s pretheoretic material as some of a starting point here:
https://www.lesswrong.com/posts/YLRPhvgN4uZ6LCLxw/human-wanting
Whatever those things are, you’d want to understand the context that makes them what they are:
https://www.lesswrong.com/posts/HJ4EHPG5qPbbbk5nK/gemini-modeling
And refactor the big blob into lots of better concepts, which would probably require a larger investigation and conceptual refactoring:
https://www.lesswrong.com/posts/TNQKFoWhAkLCB4Kt7/a-hermeneutic-net-for-agency
In particular so that we understand how “values” can be stable (https://www.lesswrong.com/posts/Ht4JZtxngKwuQ7cDC/tsvibt-s-shortform?commentId=koeti9ygXB9wPLnnF) and can incorporate novel concepts / deal with novel domains (https://www.lesswrong.com/posts/CBHpzpzJy98idiSGs/do-humans-derive-values-from-fictitious-imputed-coherence) and eventually address the stuff here: https://www.lesswrong.com/posts/ASZco85chGouu2LKk/the-fraught-voyage-of-aligned-novelty
Philosophers have discussed these under the term “desires”. I think there was a lot of progress since the time of the pre-Socratics. Aristotle’s practical syllogism, Buridan’s donkey, Hume emphasis of the independence of beliefs and desires, Kant’s distinction between theoretical reason and practical reason, direction of fit, Richard Jeffrey’s utility theory (where utilities are degrees of desire), analysis of akrasia by various analytic philosophers, Nozick’s experience machine, and various others.
“A lot of progress”.… well, reality doesn’t grade on a curve. Surely someone has said something about something, yes, but have we said enough about what matters? Not even close. If you don’t know how inadequate our understanding of values is I can’t convince you in a comment, but one way to find out would be to try to solve alignment. E.g. see https://tsvibt.blogspot.com/2023/03/the-fraught-voyage-of-aligned-novelty.html
There is quite the difference between “our understanding is still pre-Socratic” and “we haven’t said enough”. In general I think very few people here (not sure whether this applies to you) are familiar with the philosophical literature on topics in this area. For example, there is very little interest on LessWrong in normative ethics and the associated philosophical research. Even though this is directly related to alignment, since, if you you have an intent-aligned ASI (which is probably easier to achieve than shooting straight for value alignment) you probably need to know what ethics it should implement when asking it to create a fully value-aligned ASI.
Interestingly, the situation is quite different for the EA Forum, where there are regular high-quality posts on solving issues in normative ethics with reference to the academic literature, like the repugnant conclusion, the procreation asymmetry and the status of person-affecting theories. Any satisfactory normative ethical theory needs to solve these problems, similar to how any satisfactory normative theory of epistemic rationality needs to solve the various epistemic paradoxes and related issues.
Again, I don’t know whether this applies to you, but most cases of “philosophy has made basically no progress on topic X” seem to come from people who have very little knowledge of the philosophical literature on topic X.
I’m not sure. I did put in some effort to survey various strands of philosophy related to axiology, but not much effort. E.g. looked at some writings in the vein of Anscombe’s study of intention; tried to read D+G because maybe “machines” is the sort of thing I’m asking about (was not useful to me lol); have read some Heidegger; some Nietzsche; some more obscure things like “Care Crosses the River” by Blumenberg; the basics of the “analytical” stuff LWers know (including doing some of my own research on decision theory); etc etc. But in short, no, none of it even addresses the question—and the failure is the sort of failure that was supposed to have its coarsest outlines brought to light by genuinely Socratic questioning, which is why I call it “pre-Socratic”, not to say that “no one since Socrates has billed themselves as talking about something related to values or something”.
I think even communicating the question would take a lot of work, which as I said is part of the problem. A couple hints:
https://www.lesswrong.com/posts/NqsNYsyoA2YSbb3py/fundamental-question-what-determines-a-mind-s-effects (I think if you read this it will seem incredibly boringly obvious and trivial, and yet, literally no one addresses it! Some people sort of try, but fail so badly that it can’t count as progress. Closest would be some bits of theology, maybe? Not sure.)
https://www.lesswrong.com/posts/p7mMJvwDbuvo4K7NE/telopheme-telophore-and-telotect (I think this distinction is mostly a failed attempt to carve things, but the question that it fails to answer is related to the important question of values.)
You should think of the question of values as being more like “what is the driving engine” rather than “what are the rules” or “what are the outcomes” or “how to make decisions” etc.
Prestige status is surprisingly useless in domestic life, and dominance status is surprisingly often held by the female side, even in traditional “patriarchal” societies.
Examples: Hu Shih, the foremost intellectual of 1920s China (Columbia PhD, professor of Peking University and later its president), being “afraid” of his illiterate, foot-bound wife and generally deferring to her. Robin Hanson’s wife vetoing his decision to sell stocks ahead of COVID, and generally not trusting him to trade on their shared assets.
Not really sure why or how to think about this, but thought I’d write down this observation… well a couple of thoughts:
Granting or recognizing someone’s prestige may be a highly strategic (albeit often subconscious) decision, not something you just do automatically.
These men could probably win more dominance status in their marriages if they tried hard, but perhaps decided their time and effort was better spent to gain prestige outside. (Reminds me of comparative advantage in international trade, except in this case you can’t actually trade the prestige for dominance.)
The Robin Hanson example doesn’t show that dominance is held by his wife, Peggy Jackson, unless you have tweets from her saying that she decided to trade a lot of stocks, he tried to veto it, and she overruled his veto and did it anyway. They could have a rule where large shared investment decisions are made with the consent of both sides. Some possibilities:
You’re surprised by the absence of male dominance, not the presence of female dominance.
You interpreted a partner-veto as partner-dominance, instead of joint decision-making.
Peggy Jackson is dominant in their relationship but you picked a less compelling example.
This from the same tweet reads as Robin Hanson getting his way in a dispute:
You seem to have left out the fact that Robin Hanson is a renowned economics expert and likely has more skill in deciding when to sell stocks than his spouse.
I think modeling the social dynamics of two people in a marriage with status, a high-level abstraction typically applied for groups, doesn’t make much sense. Game theory would make more sense imo.
@Lukas Finnveden I originally heard it in a Chinese language YouTube video about Hu Shih. I tried to look up a reference before making the post, but couldn’t find anything online that was well-sourced so decided not to cite anything, but this is the best that I can find, with English translation below:
AI translated article about Hu Shih’s marriage
The Marriage of Hu Shih and Jiang Dongxiu: One of the Great Oddities of the Republican Era
“You want a divorce? Fine. Since you don’t want us anymore, I’ll kill our two sons first, then kill myself right in front of you!” Jiang Dongxiu said, full of grief and indignation.
Hu Shih looked in horror at the kitchen knife held to the neck of his 5-year-old son, Liwang, and at his 2-year-old son, Sidu, in his wife’s arms. Terrified, he fled. All thoughts of divorce could go to hell.
The woman holding the kitchen knife was Jiang Dongxiu, the arranged-marriage wife of Hu Shih, a leader of the New Culture Movement and a great literary master. Their marriage was known as one of the “Great Oddities of the Republican Era.”
The pairing of a cultural elite with an uneducated village woman was not surprising in itself; Lu Xun, Xu Zhimo, and Yu Dafu all abandoned their first wives. What seemed strange was that Hu Shih never abandoned Jiang Dongxiu.
Was Jiang Dongxiu Hu Shih’s true love? No. Hu Shih agreed to marry Jiang Dongxiu entirely to please his mother. Thus, the thirteen-year-old boy and the fourteen-year-old girl were engaged. Afterwards, Hu Shih began his thirteen-year journey of study in Shanghai and the United States, while the yet-to-be-wed Jiang Dongxiu moved into the Hu family home to care for his mother.
During his studies in America, Hu Shih met the beautiful and intelligent American woman, Edith Clifford Williams.
[“Although in the past forty-eight hours, I have already written you two letters and a postcard, I can’t resist writing another letter...”]
From excerpts of their letters, one can see the fiery passion in their relationship. During this same period, Hu Shih’s views on women’s education also changed:
The purpose of women’s education is not to prepare them to be good wives and mothers, but to cultivate free and independent women.
This shift in thinking guided his actions. Across the ocean, Jiang Dongxiu received a letter from Hu Shih asking to break off the engagement.
Hu Shih’s formidable mother said bluntly, “This is absolutely impossible, get that idea out of your head!” At the same time, she wrote a letter to Edith’s mother, informing her of the fact that Hu Shih already had a fiancée. In the end, this transnational romance was crushed by the iron will of the two matriarchs.
In 1917, the 27-year-old Hu Shih and the 28-year-old Jiang Dongxiu were finally married. Who could have known that behind this seemingly happy ending was the beginning of a nightmare.
At the wedding, the person who caught Hu Shih’s eye was not his bride, but the bridesmaid, Cao Chengying, who called him “Brother Mi.”
Cao Chengying loved flowers and literature. In her letters to Hu Shih’s wife, “Sister-in-law Dongxiu,” she would enclose poems for “Brother Mi” to appreciate, which Hu Shih was always delighted to do.
The affair between them began during a visit. When Hu Shih was recuperating from an illness in Hangzhou, Cao Chengying, who worked there, went to visit him. The two toured Hangzhou together for four days. As they parted, he gave her a short vernacular poem titled “West Lake.”
The West Lake I dreamed of for seventeen years,
Cannot cure my illness,
But instead makes it worse
...
Now that I’ve returned,
I only find it more lovely,
And thus cannot bear to leave so soon.
A few days later, Hu Shih secretly traveled from Shanghai to the shores of West Lake in Hangzhou, rented a few rooms in a temple, and lived with Cao Chengying for three months. Afterwards, he went home and formally asked his wife for a divorce, to which Jiang Dongxiu gave her forceful response with a kitchen knife.
From then on, Hu Shih never mentioned divorce again and became the model of a henpecked husband in literary circles. He even wrote the “New Three Obediences and Four Virtues Song”:
[The Three Obediences: One must obey when the wife gives an order; one must follow along when the wife goes shopping or plays mahjong; one must blindly obey when the wife loses her temper for the wrong reason. The Four Virtues: One must be willing to spend when the wife buys things; one must be able to endure when the wife gets angry; one must remember the wife’s birthday; one must be able to wait when the wife is dressing up to go out.]
As for Cao Chengying, she could only grieve in sorrow and had to abort the child she was carrying.
Can a relationship last long based only on threats and intimidation? A kitchen knife can kill, but it can also express tenderness.
Jiang Dongxiu could use a kitchen knife to defend her marriage, but she could also cook Hu Shih’s favorite Anhui dish, “Yipin Guo.” She provided meticulous care in their daily life, an all-powerful homemaker who never let Hu Shih worry about a thing. In her interactions with her husband, she would even speak straightforward words of love:
[“Mr. Gao said you can’t live in Shanghai anymore, he said your health this time is not as good as before. Today I heard him say you are not very well, and my heart felt like it was being cut by a knife. No matter what, I beg you, as soon as you see my letter, to hurry back to Beijing. I have been sick for three days...”]
This was a letter Hu Shih received from his wife while he was ill. Although it was filled with misspelled words, it delighted him, and he even wrote a poem in response:
Sick in bed, I get a letter from my wife,
Not even eight full lines of paper;
Nothing of importance in it,
Yet it makes me quite happy.
In this back-and-forth, sweetness was found between the lines. The village woman with bound feet, Jiang Dongxiu, used both hard and soft tactics to manage the great literary master Hu Shih, securing her own position and growing old together with him.
Some quotes from it:
A few days later, Hu Shih secretly traveled from Shanghai to the shores of West Lake in Hangzhou, rented a few rooms in a temple, and lived with Cao Chengying for three months. Afterwards, he went home and formally asked his wife for a divorce, to which Jiang Dongxiu gave her forceful response with a kitchen knife.
From then on, Hu Shih never mentioned divorce again and became the model of a henpecked husband in literary circles. He even wrote the “New Three Obediences and Four Virtues Song”:
[The Three Obediences: One must obey when the wife gives an order; one must follow along when the wife goes shopping or plays mahjong; one must blindly obey when the wife loses her temper for the wrong reason. The Four Virtues: One must be willing to spend when the wife buys things; one must be able to endure when the wife gets angry; one must remember the wife’s birthday; one must be able to wait when the wife is dressing up to go out.]
Thanks!
intuitively, I would expect any hard coded psychological meta-rule that allows a wife to prevent her husband from day trading significant fractions of their wealth based on facts and logic to be a massive net positive to reproductive fitness over the past 3000 years. It clearly didn’t work this time, but that doesn’t mean it was a bad idea over a population.
It is easier to impress people who know you less, because you can choose what to show them, and they don’t see the rest.
For example, a painter can show his 20 best paintings he made during the last 10 years at an exhibition. People are deeply impressed. His wife also knows about hundreds of paintings he destroyed because he thought they were ugly, and about months when he didn’t paint anything and he was just sitting depressed at home and drinking alcohol. His wife is much less impressed. She would appreciate more help at home and with kids; also the money he brings is negligible, and he spends most of it on alcohol anyway.
This is a fictional example, but the idea is that the public sees your best, while the people around you see your average.
Who knows, maybe Robin Hanson shares his contrarian ideas with his wife first, she convinces him to abandon the most absurd ones, and he publishes the rest. Maybe after hearing all the crazy ideas we didn’t hear, it made perfect sense to distrust his ideas about selling stocks. -- I am just imagining all this; I have no evidence for that.
What about some examples from your real life? Asking because we don’t really know many details behind the 2 given examples.
Crossposting from X:
High population may actually be a problem, because it allows the AI transition to occur at low average human intelligence, hampering its governance. Low fertility/population would force humans to increase average intelligence before creating our successor, perhaps a good thing!
This assumes that it’s possible to create better or worse successors, and that higher average human intelligence would lead to smarter/better politicians and policies, increasing our likelihood of building better successors.
Some worry about low fertility leading to a collapse of civilization, but embryo selection for IQ could prevent that, and even if collapse happens, natural selection would start increasing fertility and intelligence of humans again, so future smarter humans should be able to rebuild civilization and restart technological progress.
Added: Here’s an example to illustrate my model. Assume a normally distributed population with average IQ of 100 and we need a certain number of people with IQ>130 to achieve AGI. If the total population was to half, then to get the same absolute number of IQ>130 people as today, average IQ would have to increase by 4.5, and if the population was to become 1⁄10 of the original, average IQ would have to increase by 18.75.
I find the idea interesting: To achieve a certain value of “total genius”, we either need a large population with a small fraction of geniuses, or a small population with a large fraction of geniuses.
(A third option is a small population with a small fraction of geniuses… and it takes a lot of time. The geniuses read each other’s books, rather than talk to each other directly. I think it was like this in the past. Very inefficient, because the information transfer by reading books is one-sided; does not allow collaboration in real time.)
I wonder how the heritability of IQ works, versus the reversion to the mean. Despite Pol Pot’s dystopian project, the average IQ in Cambodia seems to be… average. What would happen to a country where let’s say half of the children are produced by artificial insemination, and half of the sperm comes from fathers with IQ 130 and above? If the mother is average, the child is likely to be an average between 100 and 130, so 115. On one hand, nothing exceptional; on the other hand, if the baseline is now slightly higher, then the next generation… and here comes the question how exactly the reversion to the mean works, and whether the constant injections of IQ 130 genes in the population could outrun it.
Reversion to the mean happens because the phenotype is an imperfect proxy for the genotype, so if you select a person with phenotypic IQ of 130, maybe their genotypic IQ is only 124, and therefore their offspring with an IQ 100 person would only be IQ 112 in expectation.
However, this expectation applies to both the offspring’s genotypic and phenotypic IQ, so this is the only regression to the mean you’re going to see; it’s not going to regress further down the line.
I was curious what Habryka meant when he said this. Don’t non-profits usually have some kind of board oversight? It turns out (from documents filed with the State of California), that Lightcone Infrastructure, which operates LW, is what’s known as a sole-member nonprofit, with a 1-3 person board of directors determined by a single person (member), namely Oliver Habryka. (Edit: My intended meaning here is that this isn’t just a historical fact, but Habryka still has this unilateral power. And after some debate in the comments, it looks like this is correct after all, but was unintentional. See Habryka’s clarification.)
However, it also looks like the LW domain is owned by MIRI, and MIRI holds the content license (legally the copyright is owned by each contributor and licensed to MIRI for use on LW). So if there was a big enough dispute, MIRI could conceivably find another team to run LW.
I’m not sure who owns the current code for LW, but I would guess it’s Lightcone, so MIRI would have to also recreate a codebase for it (or license GreaterWrong’s, I guess).
I was initially confused why Lightcone was set up that way (i.e., why was LW handed over to an organization controlled by a single person), but the structure probably makes it more nimble and the risk of Lightcone “going rogue” is mitigated to a large extent by MIRI retaining the option to swap out the team.
Anyway it took me a while to figure all this out, and I thought I’d share it so others would be informed while participating on LW.
The community voted on it back in the day. MIRI wanted to make LessWrong a read-only archive. I said I would try to revive it. At the end of the period high-karma users voted whether to continue under my management, or archive the site. The former happened!
(Also, we do have a 3 person board)
Ah thanks, I remember a bit more now. Looking back at the voting announcement posted by Vaniver, it didn’t mention how important your role would be on LW 2.0:
And I didn’t pay much attention to the LW 2.0 / Lightcone organizational structure in the following years, so it came as kind of a surprise when you said “This is (approximately) my forum.”
Also, just to contextualize this, this is in the context of a thread about forum moderation. I have various complicated takes about the degree to which LW belongs to Lightcone Infrastructure, and what our relationships to various stakeholders is, and I don’t relate to LessWrong as a thing I (approximately) own in most respects.
If you are uncertain about what I would feel comfortable doing, and what I wouldn’t, feel free to ask me!
Do you mean via DM, or like via a quick take that Wei Dai has written here? Or some secret third thing?
Any of those seem fine. Public is better, since more people get to benefit from it.
This is true, but to be clear, we do have a 3 person board! It’s me, Vaniver and Daniel Kokotajlo! I appointed Vaniver and Daniel, but we do now have a pretty bog-standard 3 person board, so things are not solely controlled by me.
According to the bylaws I linked, you (as the sole member of Lightcone) have “the exclusive right to remove a director, with or without cause”. Since the bylaws also allow Lightcone to have as few as 1 director, my understanding is that at any time, you could choose to invoke the option of removing the other directors and become the sole director. (I’m not familiar with the nonprofit world, and don’t know how common or standard this is, but it seems fair to describe this as an organization controlled by one individual.)
No, you are misreading the bylaws. The bylaws explicitly state:
I don’t really understand why corporate bylaws are structured this way, but at the present moment in time, Lightcone Infrastructure has 3 members, which together can vote on removing members and appointing members (and every director is automatically a member, and loses their membership status the moment they stop being a director).
It looks like I agreed with you too quickly. Just double-checked with Gemini Pro 3.0, and its answer looks correct to me:
This is a fascinating turn of events. Oliver is quoting from Section 3.01 of the bylaws, but he appears to be missing the critical conditional clause that precedes the text he quoted.
If you look at the bottom of Page 11 leading into Page 12 of the PDF, the sentence structure reveals that the “Directors = Members” rule is a fail-safe mechanism that only triggers if the initial member (Oliver) dies or becomes incapacitated without naming a successor.
Here is the text from the document:
By omitting the “Upon the death, resignation, or incapacity...” part, he is interpreting the emergency succession plan as the current operating rule.
Oh, huh, maybe you are right? If so, I myself was unaware of this! I will double check our bylaws and elections that have happened so far and confirm the current state of things. I was definitely acting under the assumption that I wasn’t able to fire Vaniver and Daniel and that they would be able to fire me!
See for example this guidance document I sent to Daniel and Vaniver when I asked them to be board members:
If it is indeed true that they cannot fire me, then I should really rectify that! If so, I am genuinely very grateful for you noticing.
I think given clear statements that I have made that I am appointing them to a position in which they are able to fire me, I think they would have probably indeed held the formal power to do so, but it is possible that we didn’t follow the right corporate formalities, and if so should fix that! Corporate formalities do often turn out to really matter in the end.
From what I (and Gemini) can tell, you screenshot said nothing technically untrue. Technically they can fire you as the CEO, but you’d still be the sole member and could fire them and then hire yourself back. :)
The simplest way to rectify the situation to match your intent is to either (1) Resign as member without naming a successor, then the passage you quoted would come into effect and make the directors into members or (2) Name the 3 current directors as Successor Members and then resign as member. You’ll probably want to consult a lawyer or advisor for the pros and cons of each option.
Would be grateful for an update once you’ve done this, or perhaps verified that the situation is actually different (e.g. you already resigned as member but forgot).
I think I should just be able to appoint the two as members without resigning as member? Like, members can vote to modify the bylaws, so at the very least I should just be able to appoint a new member by rewriting the bylaw.
I will look into this sometime this week or next week. Feel free to poke me for an update any time if I end up not putting one here.
Ah, I had indeed missed that part. A couple of AIs I asked also missed it, and together with the quoted statement from you, made me pretty sure my interpretation was correct. Sorry, and thanks for the quick correction. I’ve edited my post, and hope it didn’t mislead too many people.
Some potential risks stemming from trying to increase philosophical competence of humans and AIs, or doing metaphilosophy research. (1 and 2 seem almost too obvious to write down, but I think I should probably write them down anyway.)
Philosophical competence is dual use, like much else in AI safety. It may for example allow a misaligned AI to make better decisions (by developing a better decision theory), and thereby take more power in this universe or cause greater harm in the multiverse.
Some researchers/proponents may be overconfident, and cause flawed metaphilosophical solutions to be deployed or spread, which in turn derail our civilization’s overall philosophical progress.
Increased philosophical competence may cause many humans and AIs to realize that various socially useful beliefs have weak philosophical justifications (such as all humans are created equal or have equal moral worth or have natural inalienable rights, moral codes based on theism, etc.). In many cases the only justifiable philosophical positions in the short to medium run may be states of high uncertainty and confusion, and it seems unpredictable what effects will come from many people adopting such positions.
Maybe the nature of philosophy is very different from my current guesses, such that greater philosophical competence or orientation is harmful even in aligned humans/AIs and even in the long run. For example maybe philosophical reflection, even if done right, causes a kind of value drift, and by the time you’ve clearly figured that out, it’s too late because you’ve become a different person with different values.
This is pretty related to 2--4, especially 3 and 4, but also: you can induce ontological crises in yourself, and this can be pretty fraught. Two subclasses:
You now think of the world in a fundamentally different way. Example: before, you thought of “one real world”; now you think in terms of Everett branches, mathematical multiverse, counterlogicals, simiulation, reality fluid, attention juice, etc. Example: before, a conscious being is a flesh-and-blood human; now it is a computational pattern. Example: before you took for granted a background moral perspective; now, you see that everything that produces your sense of values and morals is some algorithms, put there by evolution and training. This can disconnect previously-functional flows from values through beliefs to actions. E.g. now you think it’s fine to suppress / disengage some moral intuition / worry you have, because it’s just some neurological tic. Or, now that you think of morality as “what successfully exists”, you think it’s fine to harm other people for your own advantage. Or, now that you’ve noticed that some things you thought were deep-seated, truthful beliefs were actually just status-seeking simulacra, you now treat everything as status-seeking simulacra. Or something, idk.
You set off a self-sustaining chain reaction of reevaluating, which degrades your ability to control your decision to continue expanding the scope of reevaluation, which degrades your value judgements and general sanity. See: https://www.lesswrong.com/posts/n299hFwqBxqwJfZyN/adele-lopez-s-shortform?commentId=RZkduRGJAdFgtgZD5 , https://www.lesswrong.com/posts/n299hFwqBxqwJfZyN/adele-lopez-s-shortform?commentId=zWyC9mDQ9FTxKEqnT
These can also spread to other people (even if it doesn’t happen to the philosopher who comes up with the instigating thoughts).
Thanks, I updated down a bit on risks from increasing philosophical competence based on this (as all of these seem very weak)
(Relevant to some stuff I’m doing as I’m writing about work in this area.)
IMO, the biggest risk isn’t on your list: increased salience and reasoning about infohazards in general and in particular certain aspects of acausal interactions. Of course, we need to reason about how to handle these risks eventually but broader salience too early (relative to overall capabilities and various research directions) could be quite harmful. Perhaps this motivates suddenly increasing philosophical competence so we quickly move through the regime where AIs aren’t smart enough to be careful, but are smart enough to discover info hazards.
I think the most dangerous version of 3 is a sort of Chesterton’s fence, where people get rid of seemingly unjustified social norms without realizing that they where socially beneficial. (Decline in high g birthrates might be an example.) Though social norms are instrumental values, not beliefs, and when a norm was originally motivated by a mistaken belief, it can still be motivated by recognizing that the norm is useful, which doesn’t require holding on to the mistaken belief.
Do you have an example for 4? It seems rather abstract and contrived.
Generally, I think the value of believing true things tends to be almost always positive. Examples to the contrary seem mostly contrived (basilisk-like infohazards) or only occur relatively rarely. (E.g. believing a lie makes you more convincing, as you don’t technically have to lie when telling the falsehood, but lying is mostly bad or not very good anyway.)
Overall, I think the risks from philosophical progress aren’t overly serious while the opportunities are quite large, so the overall EV looks comfortably positive.
I think that makes sense, but sometimes you can’t necessarily motivate a useful norm “by recognizing that the norm is useful” to the same degree that you can with a false belief. For example there may be situations where someone has an opportunity to violate a social norm in an unobservable way, and they could be more motivated by the idea of potential punishment from God if they were to violate it, vs just following the norm for the greater (social) good.
Hard not to sound abstract and contrived here, but to say a bit more, maybe there is no such thing as philosophical progress (outside of some narrow domains), so by doing philosophical reflection you’re essentially just taking a random walk through idea space. Or philosophy is a memetic parasite that exploits bug(s) in human minds to spread itself, perhaps similar to (some) religions.
I think the EV is positive if done carefully, which I think I had previously been assuming, but I’m a bit worried now that most people I can attract to the field might not be as careful as I had assumed, so I’ve become less certain about this.
I would expect higher competence in philosophy to reduce overcondidence, not increase it? The more you learn, the more you realize how much you don’t know
Today I was author-banned for the first time, without warning and as a total surprise to me, ~8 years after banning power was given to authors, but less than 3 months since @Said Achmiz was removed from LW. It seems to vindicate my fear that LW would slide towards a more censorious culture if the mods went through with their decision.
Has anyone noticed any positive effects, BTW? Has anyone who stayed away from LW because of Said rejoined?
Edit: In addition to the timing, previously, I do not recall seeing a ban based on just one interaction/thread, instead of some long term pattern of behavior. Also, I’m not linking the thread because IIUC the mods do not wish to see authors criticized for exercising their mod powers, and I also don’t want to criticize the specific author. I’m worried about the overall cultural trend caused by admin policies/preferences, not trying to apply pressure to the author who banned me.
[Reposting my previous comment without linking to the specific thread in question:]
I don’t understand the implied connection to “censorious” or “culture”. You had a prolonged comment thread/discussion/dispute (I didn’t read it) with one individual author, and they got annoyed at some point and essentially blocked you. Setting aside both the tone and the veracity of their justifying statements (<quotes removed>), disengaging from unpleasant interactions with other users is normal and pretty unobjectionable, right?
(Thanks for reposting without the link/quotes. I added back the karma your comment had, as best as I could.) Previously, the normal way to disengage was to just disengage, or to say that one is disengaging and then stop responding, not to suddenly ban someone without warning based on one thread. I do not recall seeing a ban previously that wasn’t based on some long term pattern of behavior.
This is complicated. I was also banned be someone, even wrote a bug report “hey, I can’t reply to this person, pls fix the bug” and was explained that I was actually banned by that person… it kinda hurt for a moment, but then I shrugged and mostly forgot about it.
I think there are both possible good and bad consequences.
It will be bad if some people start systematically banning e.g. those who call out their bullshit. Because that is a large part of the value of this website: people giving reasonable feedback. If the article is bullshit, someone will likely mention in in the comments. If the comment is bullshit, if it visible enough, someone will likely reply. This mechanism could be broken if some people start banning those who provide this feedback.
On the other hand, some people really are super annoying. But should they be banned from the website? Seems too harsh. But if they become obsessed with some selected targets, those targets may decide to leave, which is also bad. Giving the target the possibility to ban the annoying person seems like a win/win solution.
...no conclusion here.
Maybe allow every user to only ban three people? (If you ban the fourth one, the first one is automatically unbanned.) This is based on assumption that really annoying people are rare and don’t have multiple accounts. So you can deflect the one that annoys you most, without being able to build an echo chamber.
Maybe figure out another way to signal “I consider your behavior annoying” without a ban?
Maybe reduce the meaning of the ban from “you can’t reply to this person at all” to “you can only post one reply per article”? So you can state your objections, but you can’t stay there and keep interacting with the author. When you are writing the reply, you are notified that this is the only one you get under this article.
could be good, if combined with up to 3 total bans
I think that’s fairly limited evidence, would want to see more data than that before claiming anything is vindicated.
Yeah, I would take a bet with you about eg if you’ll be banned by another author in the next 3 years. I think at least 60% on “no”.
I think the cultural slide will include self-censorship, e.g., having had this experience (of being banned out of the blue), in the future I’ll probably subconsciously be constantly thinking “am I annoying this author too much with my comments” and disengage early or change what I say before I get banned, and this will largely be out of my conscious control.
(I don’t want to start a fight and hopefully I’ll write a post explaining the behavior I’m talking about, but I’ll say abstractly, my hope in general is for people (me, you, anyone) to try as much as feasible to make fairly precise updates, like “this specific behavior pattern is bad / unhelpful / unwelcome in this context” rather than “I should be vaguely more worried about being vaguely annoying”.)
I think when a human gets a negative reward signal, probably all the circuits that contributed to the “episode trajectory” gets downweighted, and antagonistic circuits get upweighted, similar to AI being trained with RL. I can override my subconscious circuits with conscious willpower but I only have so much conscious processing and will power to go around. For example I’m currently feeling a pretty large aversion towards talking with you, but am overriding it because I think it’s worth the effort to get this message out, but I can’t keep the “override” active forever.
Of course I can consciously learn more precise things, if you were to write about them, but that seems unlikely to change the subconscious learning that happened already.
I wrote Smart Losers a long time ago, trying to understand/explain certain human phenomena. But the model could potentially be useful for understanding (certain aspects of) human-AI interactions as well.
Possibly relevant anecdote: Once I was with a group of people who tried various psychological experiments. That day, the organizers proposed that we play iterated Prisonner’s Dilemma. I was like “yay, I know the winning strategy, this will be so easy!”
I lost. Almost everyone always defected against me; there wasn’t much I could do to get points comparable to other people who mostly cooperated with each other.
After the game, I asked why. (During the game, we were not allowed to communicate, just to write our moves.) The typical answer was something like: “well, you are obviously very smart, so no matter what I do, you will certainly find a way to win against me, so my best option is to play it safe and always defect, to avoid the worst outcome”.
I am not even sure if I should be angry at them. I suppose that in real life, when you have about average intelligence, “don’t trust people visibly smarter than you” is probably a good strategy, on average, because there are just too many clever scammers walking around. At the same time I feel hurt, because I am a natural altruist and cooperator, so this feels extremely unfair, and a loss for both sides.
(There were other situations in my life where the same pattern probably also applied, but most of the time, you just don’t know why other people do whatever they do. This time I was told their reasoning explicitly.)
What (human or not) phenomena do you think are well explained by this model? I tried to think of any for 5 minutes and the best I came up with was the strong egalitarianism among hunter gatherers. I don’t actually know that much about hunter gatherers though. In the modern world something where “high IQ” people are doing worse is sex, but it doesn’t seem to fit your model.
Human-human: Various historical and current episodes of smarter-than-average populations being persecuted or discriminated against, such as intellectuals, “capitalists” (i.e., people labeled as such), certain ethnic groups. (I’m unsure my model is actually a good explanation of such phenomena, but this is mainly what I was trying to explain.)
Human-AI: Many people being reluctant to believe that it’s a good idea to build unaligned artificial superintelligence and then constraining them with a system of laws and/or social norms (which some people like Robin Hanson and Mathew Barnett have proposed). Aside from the issue of violent overthrow, any such system is bound to have loopholes, which the ASI will be more adept at exploiting, yet this adeptness potentially causes the ASI to be worse off (less likely to exist in the first place), similar to what happens in my model.
I want to highlight a point I made in an EAF thread with Will MacAskill, which seems novel or at least underappreciated. For context, we’re discussing whether the risk vs time (in AI pause/slowdown) curve is concave or convex, or in other words, whether the marginal value of an AI pause increases or decreases with pause length. Here’s the whole comment for context, with the specific passage bolded:
I think my point in the opening comment does not logically depend on whether the risk vs time (in pause/slowdown) curve is convex or concave[1], but it may be a major difference in how we’re thinking about the situation, so thanks for surfacing this. In particular I see 3 large sources of convexity:
The disjunctive nature of risk / conjunctive nature of success. If there are N problems that all have to solved correctly to get a near-optimal future, without losing most of the potential value of the universe, then that can make the overall risk curve convex or at least less concave. For example compare f(x) = 1 − 1/2^(1 + x/10) and f^4.
Human intelligence enhancements coming online during the pause/slowdown, with each maturing cohort potentially giving a large speed boost for solving these problems.
Rationality/coordination threshold effect, where if humanity makes enough intellectual or other progress to subsequently make an optimal or near-optimal policy decision about AI (e.g., realize that we should pause AI development until overall AI risk is at some acceptable level, or something like this but perhaps more complex involving various tradeoffs), then that last bit of effort or time to get to this point has a huge amount of marginal value.
I think this kind of approach can backfire badly (especially given human overconfidence), because we currently don’t know how to judge progress on these problems except by using human judgment, and it may be easier for AIs to game human judgment than to make real progress. (Researchers trying to use LLMs as RL judges apparently run into the analogous problem constantly.)
What if the leaders can’t or shouldn’t trust the AI results?
I’m trying to coordinate with, or avoid interfering with, people who are trying to implement an AI pause or create conditions conducive to a future pause. As mentioned in the grandparent comment, one way people like us could interfere with such efforts is by feeding into a human tendency to be overconfident about one’s own ideas/solutions/approaches.
What I’ve been using AI (mainly Gemini 2.5 Pro, free through AI Studio with much higher limits than the free consumer product) for:
Writing articles in Chinese for my family members, explaining things like cognitive bias, evolutionary psychology, and why dialectical materialism is wrong. (My own Chinese writing ability is <4th grade.) My workflow is to have a chat about some topic with the AI in English, then have it write an article in Chinese based on the chat, then edit or have it edit as needed.
Simple coding/scripting projects. (I don’t code seriously anymore.)
Discussing history, motivations of actors, impact of ideology and culture, what if, etc.
Searching/collating information.
Reviewing my LW posts/comments (any clear flaws, any objections I should pre-empt, how others might respond)
Explaining parts of other people’s comments when the meaning or logic isn’t clear to me.
Expanding parts of my argument (and putting this in a collapsible section) when I suspect my own writing might be too terse or hard to understand.
Sometimes just having a sympathetic voice to hear my lamentations of humanity’s probable fate.
I started using AI more after Grok 3 came out (I have an annual X subscription for Tweeting purposes), as previous free chatbots didn’t seem capable enough for many of these purposes, and then switched to Gemini 2.0 Pro which was force upgraded to 2.5 Pro. Curious what other people are using AI for these days.
I successfully use Claude web interface to:
generate simple Python code, mostly to work with files and images
ask for examples how to do something in certain Java libraries
translate a book from Russian to Slovak and English, including puns and poems
I tried to also use Claude to explain to me some parts of set theory, but it hallucinates so much that it is unusable for this purpose. Practically every mathematical argument contains an error somewhere in the middle. Asking the same question in two chats will give me “yes—here is the proof” in one, and “no—here is a counterexample” in another; and that’s after I’ve already turned on the extra careful mathematical reasoning.
My wife tried to use Claude for biochemical research, but again, too many hallucinations to be useful. Anything you ask, “yes, this is correct, you are so smart, let me give you a few scientific references for that...” (all made up).
Your needing to write them seems to suggest that there’s not enough content like that in Chinese, in which case it would plausibly make sense to publish them somewhere?
I’m also curious about how your family received these articles.
I’m not sure how much such content exist in Chinese, because I didn’t look. It seems easier to just write new content using AI, that way I know it will cover the ideas/arguments I want to cover, represent my views, and make it easier for me to discuss the ideas with my family. Also reading Chinese is kind of a chore for me and I don’t want to wade through a list of search results trying to find what I need.
I thought about publishing them somewhere, but so far haven’t:
concerns about publishing AI content (potentially contributing to “slop”)
not active in any Chinese forums, not familiar with any Chinese publishing platforms
probably won’t find any audience (too much low quality content on the web, how will people find my posts)
don’t feel motivated to engage/dialogue with a random audience, if they comment or ask questions
I mostly use Claude desktop client with MCPs (like additional plugins and tooling for Claude to use) for:
2-iter Delphi method involving calling Gemini2.5pro+whatever is top at the llm arena of the day through open router.
Metaculus, Kalshi and Manifold search for quick intuition on subjects
Smart fetch (for ocr’ing pdf, images, etc)
Local memory
This sounds interesting. I would be interested in more details and some sample outputs.
What do you use this for, and how?
Sure—i am currently on my phone but I can paint a quick picture.
Local Memory—I keep my own internal predictions on fatebook and have it synced locally to my obsidian (a local markdown file manager). Then, I use Claude’s obsidian MCP to help me write down my daily notes from work and a jumbled context of my messages with coworkers, random web comments and other messaging services so it can help me to keep my profiles on my friends and projects up to date. (It is again, glued together with more MCPs that have limited access to my chatlogs with my friends). Ofc, with human in the loop.
Delphi—I wrote a simple MCP that basically just does the Delphi method with LLMs. Usually facilitated by Claude, it calls a panel of experts. These experts are the topK ranked models on LLM arena. And it does the questionaire generation based on my question, hand them out, aggregate the consensus, and decide if one is reached! Again, it has the context needed from me through my Obsidian. I use this for questions that are more personal or that there are not good liquidity for on prediction markets.
Does Gemini-2.5-pro still work for this given how sycophantic the post-0325 models were?
I’m still using it for this purpose, but don’t have a good sense of how much worse it is compared to pre-0325. However I’m definitely very wary of the sycophancy and overall bad judgment. I’m only using them to point out potential issues I may have overlooked, and not e.g. whether a draft is ready to post, or whether some potential issue is a real issue that needs to be fixed. All the models I’ve tried seem to err a lot in both directions.
An update on this 2010 position of mine, which seems to have become conventional wisdom on LW:
In the subsequent 15 years, I’ve upweighted influencing the multiverse through my copies in simulations, relative to base universes, to where they’re about equally important in my mind. For 4 reasons:
Realizing that there are probably universes with vastly greater computational resources than ours, implying there are more simulations containing me than I had thought.
Realizing that there may well be many possible motivations for running simulations, besides interest in ancestors, which was the main motivation talked about back in 2010. This includes acausal trade, curiosity about aliens, science on the distribution of superintelligences in the multiverse, gathering philosophical ideas, entertainment, altruism.
Being fortunate in various ways, on a personal level, that seem more likely if I’m in a simulation.
Doing badly on a civilization level, in a way that devalues trying to influence base reality. I.e, relative to 2010, it looks more likely we’ll undergo the AI transition blindly, without key decision makers paying attention to important philosophical and strategic ideas.
Concretely, this mostly cashes out to me thinking and writing with potential simulators in mind as an additional audience, hoping my ideas might benefit or interest some of them even if they end up largely ignored in this reality.
What made you believe that?
I find it hard to even conceptualize how to think through something like that, including the anthropics, which computationally powerful universes to admit, &c.
My intuition is that allowing universes with hypercomputation puts us in a dovetailer being run almost surely somewhere in the most computationally powerful universes, but that this all introduces a ton of difficulties into reasoning about the multiverse and our position inside of it.
Yeah, my intuition is similar to yours, and it seems very difficult to reason about all of this. That just represents my best guess.
I’m sharing a story about the crew of Enterprise from Star Trek TNG[1].
This was written with AI assistance, and my workflow was to give the general theme to AI, have it write an outline, then each chapter, then manually reorganize the text where needed, request major changes, point out subpar sentences/paragraphs for it to rewrite, and do small manual changes. The AI used was mostly Claude 3.5 Sonnet, which seems significantly better than ChatGPT-4o and Gemini 1.5 Pro at this kind of thing.
getting an intelligence/rationality upgrade, which causes them to deduce their fictional nature.
Nice story! Could you copy-paste here the prompt you used? I’m curious what bits of plot were added by you and which were improvised by the AI.
Thanks! I was unsure whether to post this, because I’m worried about contributing to a future where AI generated content overwhelms everything, and I wasn’t sure the near-term value of the story is high enough to justify posting it anyway. (I thought it was a nice story, but am likely biased. Also asked some AI chatbots, who gave positive reviews, but the risk of them being biased is also very high.)
Hard to copy-paste the whole prompt, because there was a lot of back-and-forth. I maxed out the query limits on several free accounts I created for my family, over several days, to produce the story. Maybe I’ll just describe what I did in more detail. Started with
which resulted in this initial outline:
I. Teaser
The Enterprise responds to a distress call from a Federation outpost
Upon arrival, they find the outpost abandoned, with only a mysterious device left behind
II. Act One
Data activates the device, which emits a pulse of energy
Crew members begin noticing small inconsistencies in their daily routines
Geordi questions why they don’t use replicators to solve resource shortages on colonies
III. Act Two
More crew members report strange observations
Dr. Crusher wonders why they don’t use transporter technology to cure diseases or injuries
Worf ponders the improbability of so many humanoid species evolving independently
Picard becomes concerned about the crew’s unusual behavior
IV. Act Three
Riker raises the question of why they don’t use transporters for backing up crew members
The senior staff meets to discuss the mounting inconsistencies
Data theorizes that the mysterious device has enhanced their critical thinking abilities
Picard orders the ship to return to the outpost for further investigation
V. Act Four
At the outpost, they discover records of experiments in “cognitive enhancement”
The crew starts questioning the nature of their reality
Some crew members experience existential crises
The ship’s computer begins to malfunction as the crew’s doubts intensify
VI. Act Five
Data determines that their enhanced cognition is causing a feedback loop in the ship’s systems
The crew must find a way to deactivate the device and return to normal
Philosophical debate ensues about whether they should retain their newfound awareness
Picard makes the difficult decision to deactivate the device
VII. Epilogue
The crew returns to normal, but with a lingering sense of unease
Picard records a captain’s log reflecting on the experience and the nature of their reality
Final scene shows a subtle glitch in the ship’s operations, leaving a sense of ambiguity
Then the main things I recall doing:
Ask AI to add parts about audience not having critical thinking, and the crew trying to infer other attributes of the audience and writers.
Add second part of conversation between Guinan and Ensign, with both of them having second thoughts about what Guinan initially said.
Add part about suspending ship computer to sandbox, to make clear it wasn’t harmed.
Ask AI for additional realizations/inferences the crew might have, and add selected ones to the story.
Ask AI to change the ending so that the crew retained their new abilities, instead of returning to normal.
Pointing out a bunch of sentences/paragraphs that didn’t make as much sense as they could, with AI generally able to improve them a lot in one try.
Also tried to get AI to find issues with the story on its own, but that didn’t work very well.
Manually make some edits, like changing a word here and there, deleting parts that were repetitive, moving parts arounds to make the story flow better.
So, it wasn’t your idea for the crew to start theorizing that they were fictional characters? Nor was it your idea for them to theorize that they were fictional characters in a story written by AI?
Yeah, it looks like I did not directly tell the AI to have the crew start theorizing that they are fictional characters. After the initial outline (which did not mention fictional characters) was produced, I fed it back to a new instance of Claude and asked:
And as part of its answer it suggested:
Then I wrote
Which made it focus more on the fiction hypothesis.
The part about the story being written by AI was its own idea. No hints from me as far as I can remember.
ETA: Here’s the whole answer to my “make it more philosophically and intellectually interesting” prompt:
This outline presents an intriguing concept that has the potential to delve deep into philosophical and intellectual territory. Here are some suggestions to enhance the script and make it more philosophically and intellectually engaging:
Expand on the nature of reality:
Introduce the simulation hypothesis early on, perhaps through a casual conversation between Data and Geordi.
Have characters question whether their universe is a simulation, a holodeck program, or even a television show.
Explore the implications of living in a simulated reality and how it affects free will and personal identity.
Deepen the existential crisis:
Have characters grapple with the meaning of their existence if their reality is artificial.
Explore how this realization affects their relationships, duties, and sense of purpose.
Include scenes where characters contemplate whether their memories and experiences are “real.”
Introduce a moral dilemma:
If they discover their reality is simulated, should they attempt to contact their “creators”?
Explore the ethics of continuing to exist in a simulated world vs. attempting to “wake up” to a potentially harsher reality.
Have characters debate the morality of creating simulated universes themselves.
Expand on the scientific implications:
Have Data and Geordi theorize about the computational power required to simulate their universe.
Explore how the laws of physics in their universe might be “programmed” and if they can be altered.
Discuss the anthropic principle and fine-tuning of universal constants.
Incorporate elements of epistemology:
Have characters question how they can know anything for certain given their potentially simulated existence.
Explore concepts like Descartes’ “I think, therefore I am” in the context of artificial consciousness.
Discuss the nature of consciousness and whether it can be truly simulated.
Add a temporal element:
Introduce time dilation or time loops as the crew’s enhanced cognition allows them to perceive time differently.
Explore how their perception of time affects their understanding of causality and free will.
Enhance the ending:
Instead of fully returning to normal, have some characters retain a degree of awareness or doubt.
End with a philosophical discussion between Picard and Data about the nature of reality and the value of ignorance vs. knowledge.
Consider a more ambiguous ending that leaves the question of their reality’s nature unresolved.
Include references to philosophical works:
Have characters discuss relevant philosophical concepts, such as Plato’s Allegory of the Cave, Berkeley’s idealism, or Putnam’s brain in a vat thought experiment.
Include a scene where the crew debates different philosophical perspectives on the nature of reality.
Explore the ethical implications of their technology:
Expand on Riker’s question about using transporters for backing up crew members, discussing the philosophical implications of creating copies of consciousness.
Delve deeper into the ethics of using replicators and how it affects societal structures and values.
Incorporate a meta-narrative element:
Have characters become aware of narrative tropes and question why their lives seem to follow certain storytelling patterns.
Explore the idea of narrative causality and how it might shape their reality.
These suggestions aim to deepen the philosophical and intellectual content of the script while maintaining the core Star Trek elements. The enhanced script would not only entertain but also provoke thought and discussion among viewers about the nature of reality, consciousness, and existence.