AI labs are starting to build AIs with capabilities that are hard for humans to oversee, such as answering questions based on large contexts (1M+ tokens), but they are still not deploying “scalable oversight” techniques such as IDA and Debate. (Gemini 1.5 report says RLHF was used.) Is this more good news or bad news?
Good: Perhaps RLHF is still working well enough, meaning that the resulting AI is following human preferences even out of training distribution. In other words, they probably did RLHF on large contexts in narrow distributions, with human rater who have prior knowledge/familiarity of the whole context, since it would be too expensive to do RLHF with humans reading diverse 1M+ contexts from scratch, but the resulting chatbot is working well even outside the training distribution. (Is it actually working well? Can someone with access to Gemini 1.5 Pro please test this?)
Bad: AI developers haven’t taken alignment seriously enough to have invested enough in scalable oversight, and/or those techniques are unworkable or too costly, causing them to be unavailable.
From my experience doing early RLHF work for Gemini, larger models exploit the reward model more. You need to constantly keep collecting more preferences and retraining reward models to make it not exploitable. Otherwise you get nonsensical responses which have exploited the idiosyncracy of your preferences data. There is a reason few labs have done RLHF successfully.
This seems to be evidence that RLHF does not tend to generalize well out-of-distribution, causing me to update the above “good news” interpretation downward somewhat. I’m still very uncertain though. What do others think?
While this worked well, for even a slightly more complicated problem the model failed. One Twitter user suggested just adding a random ‘iPhone 15’ in the book text and then asking the model if there is anything in the book that seems out of place in the book. And the model failed to locate it.
The same was the case when the model was asked to summarize a 30-minute Mr. Beast video (over 300k tokens). It generated the summary but many people who had watched the video pointed out that the summary was mostly incorrect.
So while on paper this looked like a huge leap forward for Google, it seems that in practice it’s not performing as well as they might have hoped.
But is this due to limitations of RLHF training, or something else?
RLHF with humans might also soon get obsoleted by things like online DPO where another chatbot produces preference data for on-policy responses of the tuned model, and there is no separate reward model in the RL sense. If generalization from labeling instructions through preference decisions works in practice, even weak-to-strong setting won’t necessarily be important, if tuning of a stronger model gets bootstrapped by a weaker model (where currently SFT from an obviously off-policy instruct dataset seems to suffice), but then the stronger model re-does the tuning of its equally strong successor that starts with the same base model (as in the self-rewarding paper), using some labeling instructions (“constitution”). So all that remains of human oversight that actually contributes to the outcome is labeling instructions written in English, and possibly some feedback on them from spot checking what’s going on as a result of choosing particular instructions.
My guess is that we’re currently effectively depending on generalization. So “Good” from your decomposition. (Though I think depending on generalization will produce big issues if the model is scheming, so I would prefer avoiding this.)
since it would be too expensive to do RLHF with humans reading diverse 1M+ contexts from scratch
It’s plausible to me that after doing a bunch of RLHF on short contexts, RLHF on long contexts is extremely sample efficient (when well tuned) such that only (e.g.) 1,000s of samples sufficies[1]. If you have a $2,000,000 budget for long context RLHF and need only 1,000 samples, you can spend $2,000 per sample. This gets you perhaps (e.g.) 10 hours of time of an experienced software engineer which might suffice for good long context supervision without necessarily needing any fancy scalable oversight approaches. (That said, probably people will use another LLM by default when trying to determine the reward if their spending this long: recursive reward modeling seems almost certain by default if we’re assuming that people spend this much time labeling.)
That said, I doubt that anyone has actually started doing extremely high effort data labeling like this, though plausibly they should...
From a previous comment: [...] This seems to be evidence that RLHF does not tend to generalize well out-of-distribution
It’s some evidence, but exploiting a reward model seems somewhat orthogonal to generalization out of distribution: exploitation is heavily selected for.
(Separately, I expect that the quoted comment results in a misleadingly negative perception of the current situation.)
I think experiments on sample efficiency of RLHF when generalizing to a new domain could be very important and are surprisingly underdone from my perspective (at least I’m not aware of interesting results). Even more important is sample efficiency in cases where you have a massive number of weak labels, but a limited number of high quality labels. It seems plausible to me that the final RLHF approach used will look like training the reward model on a combination of 100,000s of weak labels and just 1,000 very high quality labels. (E.g. train a head on the weak labels and then train another head to predict the difference between the weak label and the strong label.) In this case, we could spend a huge amount of time on each label. E.g., with 100 skilled employees we could spend 5 days on each label and still be done in 50 days which isn’t too bad of a delay. (If we’re fine with this labels trickling in for online training, the delay could be even smaller.)
Thanks for some interesting points. Can you expand on “Separately, I expect that the quoted comment results in a misleadingly perception of the current situation.”? Also, your footnote seems incomplete? (It ends with “we could spend” on my browser.)
Can you expand on “Separately, I expect that the quoted comment results in a misleadingly negative perception of the current situation.”?
I’m skeptical that increased scale makes hacking the reward model worse. Of course, it could (and likely will/does) make hacking human labelers more of a problem, but this isn’t what the comment appears to be saying.
Note that the reward model is of the same scale as the base model, so the relative scale should be the same.
This also contradicts results from an earlier paper by Leo Gao. I think this paper is considerably more reliable than the comment overall, so I’m inclined to believe the paper or think that I’m misunderstanding the comment.
Additionally, from first principles I think that RLHF sample efficiency should just increase with scale (at least with well tuned hyperparameters) and I think I’ve heard various things that confirm this.
Bad: AI developers haven’t taken alignment seriously enough to have invested enough in scalable oversight, and/or those techniques are unworkable or too costly, causing them to be unavailable.
Turns out at least one scalable alignment team has been struggling for resources. From Jan Leike (formerly co-head of Superalignment at OpenAI):
Over the past few months my team has been sailing against the wind. Sometimes we were struggling for compute and it was getting harder and harder to get this crucial research done.
I have access to Gemini 1.5 Pro. Willing to run experiments if you provide me with an exact experiment to run, plus cover what they charge me (I’m assuming it’s paid, I haven’t used it yet).
I’m thinking that the most ethical (morally least risky) way to “insure” against a scenario in which AI takes off and property/wealth still matters is to buy long-dated far out of the money S&P 500 calls. (The longest dated and farthest out of the money seems to be Dec 2029 10000-strike SPX calls. Spending $78 today on one of these gives a return of $10000 if SPX goes to 20000 by Dec 2029, for example.)
My reasoning here is that I don’t want to provide capital to AI industries or suppliers because that seems wrong given what I judge to be high x-risk their activities are causing (otherwise I’d directly invest in them), but I also want to have resources in a post-AGI future in case that turns out to be important for realizing my/moral values. Suggestions welcome for better/alternative ways to do this.
This probably does help capitalize AI companies a little bit, demand for call options will create demand for the underlying. This is probably a relatively small effect (?), but I’m not confident in my ability to estimate this at all.
Default seems unlikely, unless the market moves very quickly, since anyone pursuing this strategy is likely to be very small compared to the market for the S&P 500.
(Also consider that these pay out in a scenario where the world gets much richer — in contrast to e.g. Michael Burry’s “Big Short” swaps, which paid out in a scenario where the market was way down — so you’re just skimming a little off the huge profits that others are making, rather than trying to get them to pay you at the same time they’re realizing other losses.)
It seems hard to buy AI companies at the moment. The only way is to buy tech giants like Microsoft, Google, nVidea which are already valuad very highly—seems like’s it’s somewhat priced in. It’s also not clear to me that most of the value of AI will accrue to them. I’m confused about this though.
It would seem one would want to buy Nasdaq rather than SPX? On the other hand, maybe most tech companies will be wiped out by AI—it’s the world of atoms that would gain relative value.
It’s also not clear to me that most of the value of AI will accrue to them. I’m confused about this though.
I’m also uncertain, and its another reason for going long a broad index instead. I would go even broader than S&P 500 if I could, but nothing else has option chains going out to 2029.
I find it curious that none of my ideas have a following in academia or have been reinvented/rediscovered by academia (including the most influential ones so far UDT, UDASSA, b-money). Not really complaining, as they’re already more popular than I had expected (Holden Karnofsky talked extensively about UDASSA on an 80,000 Hour podcast, which surprised me), it just seems strange that the popularity stops right at academia’s door. (I think almost no philosophy professor, including ones connected with rationalists/EA, has talked positively about any of my philosophical ideas? And b-money languished for a decade gathering just a single citation in academic literature, until Satoshi reinvented the idea, but outside academia!)
Clearly academia has some blind spots, but how big? Do I just have a knack for finding ideas that academia hates, or are the blind spots actually enormous?
I think the main reason why UDT is not discussed in academia is that it is not a sufficiently rigorous proposal, as well as there not being a published paper on it. Hilary Greaves says the following in this 80k episode:
Then as many of your listeners will know, in the space of AI research, people have been throwing around terms like ‘functional decision theory’ and ‘timeless decision theory’ and ‘updateless decision theory’. I think it’s a lot less clear exactly what these putative alternatives are supposed to be. The literature on those kinds of decision theories hasn’t been written up with the level of precision and rigor that characterizes the discussion of causal and evidential decision theory. So it’s a little bit unclear, at least to my likes, whether there’s genuinely a competitor to decision theory on the table there, or just some intriguing ideas that might one day in the future lead to a rigorous alternative.
I also think it is unclear to what extent UDT and updateless are different from existing ideas in academia that are prima facie similar, like McClennen’s (1990) resolute choice and Meacham’s (2010, §4.2) cohesive decision theory.[1] Resolute choice in particular has been discussed in a lot of detail, and for a long time (see the citations of McClennen’s book). (And, FWIW, my sense is that most philosophers think that resolute choice is irrational and/or doesn’t make sense, at least if it is cashed out as a decision rule based on preference.)
It also doesn’t help that it is unclear what the difference between FDT and UDT is supposed to be.
(If UDT is supposed to be an LDT of some sort, then you might want to check out Spohn’s (2012)[2] version of CDT, Fisher’s (n.d) disposition-based decision theory, and Poellinger’s (2013) discussion of Spohn’s theory, for ideas in academia that are similar to the LDT-part of the theory. And then there is also Schwarz’ critique of FDT, which would then also apply to UDT, at least partially.)
My own take, using the terminology listed here, is that the causalist version of Meacham’s cohesive decision theory is basically “updateless CDT”, that the evidentialist version is basically “updateless EDT”, and that a Spohn-CDT version of cohesive decision theory is basically “U(C)DT/F(C)DT”. I also think that resolute choice is much more permissive than e.g. cohesive decision theory and updatelessness. As a decision rule, it doesn’t recommend anything close to “maximizing EU relative to your prior”. Instead, it just states that (i) what you prefer ex ante in a dynamic choice problem should be the same as you what you prefer in the normalised version of the problem, and (ii) you should be dynamically consistent (i.e., the most preferred plan should not change throughout the decision problem).
It may be worth thinking about why proponents of a very popular idea in this community don’t know of its academic analogues, despite them having existed since the early 90s[1] and appearing on the introductory SEP page for dynamic choice.
Academics may in turn ask: clearly LessWrong has some blind spots, but how big?
It may be worth thinking about why proponents of a very popular idea in this community don’t know of its academic analogues
I don’t think this is fair, because even though component ideas behind UDT/FDT have academic analogues, it doesn’t look like someone put them together into a single decision theory formulation in academic literature, at least prior to MIRI’s “Cheating Death in Damascus” being published. Also “Cheating Death in Damascus” does cite both Meacham and Spohn (and others) and it seems excusable for me to have forgotten those references since they were both published after I wrote about UDT and again were only component ideas of it, plus I haven’t actively worked on decision theory for several years.
I think Sami’s comment is entirely fair given the language and framing of the original post. It is of course fine to forget about references, but e.g. “I find it curious that none of my ideas have a following in academia or have been reinvented/rediscovered by academia” and “Clearly academia has some blind spots, but how big?” reads like you don’t consider it a possilbity that you might have re-invented something yourself, and that academics are at fault for not taking up your ideas.
(It sucks to debate this, but ignoring it might be interpreted as tacit agreement. Maybe I should have considered the risk that something like this would happen and not written my OP.)
When I wrote the OP, I was pretty sure that the specific combination of ideas in UDT has not been invented or re-invented or have much of a following in academia, at least as of 2019 when Cheating Death in Damascus was published, because the authors of that paper obviously did a literature search and would have told me if they had found something very similar to UDT in the literature, and I think I also went through the papers it referenced as being related and did not find something that had all of the elements of UDT (that’s probably why your references look familiar to me). Plus FDT was apparently considered novel enough that the reviewers of the paper didn’t tell the authors that they had to call it by the name of an existing academic decision theory.
So it’s not that I “don’t consider it a possibility that you might have re-invented something yourself” but that I had good reason to think that’s not the case?
I think there is nothing surprising that small community of nerds writing in spare time has blind spots, but when large professional community has such blind spots that’s surprising.
On your first point: as Sami writes, resolute choice is mentioned in the introductory SEP article on dynamic choice (it even has its own section!), as well as in the SEP article on decision theory. And SEP is the first place you go when you want to learn about philosophical topics and find references.
On your second point: as I wrote in my comment above, (i) academics have produced seemingly similar ideas to e.g. updatelessness (well before they were written up on LW) so it is unclear why academics should engage with less rigorous, unpublished proposals that appear to be similar (in other words, I don’t think the phrase “blind spots” is warranted), and (ii) when academics have commented on or engaged with LW DT ideas, they have to my knowledge largely been critical (e.g. see the post by Wolfgang Schwarz I linked above, as well as the quote from Greaves)[1].
To clarify, by “blind spot” I wasn’t complaining that academia isn’t engaging specifically with posts written up on LW, but more that nobody in academia seems to think that the combination of “updateless+logical” is clearly the most important or promising direction to explore in decision theory.
I think the main reason why UDT is not discussed in academia is that it is not a sufficiently rigorous proposal, as well as there not being a published paper on it.
The reason for the former is that I (and others) have been unable to find a rigorous formulation of it that doesn’t have serious open problems. (I and I guess other decision theory researchers in this community currently think that UDT is more of a relatively promising direction to explore, rather than a good decision theory per se.)
And the reason for the latter is the above, plus my personal distaste for writing/publishing academic papers (which I talked about elsewhere in this thread), plus FDT having been published which seems close enough to me.
Thank for the references in the rest of your comment. I think I’ve come across Meacham 2010 and Spohn 2012 before, but forgot about them as I haven’t been working actively on decision theory for a while. It does seem that Meacham’s cohesive decision theory is equivalent to updateless EDT/CDT. (BTW in The Absent-Minded Driver I referenced a 1997 paper that also has an idea similar to updatelessness, although the authors didn’t like it.)
On a quick skim of Spohn 2012 I didn’t see something that looks like LDT or “algorithmic/logical agent ontology” but it’s quite long/dense so I’ll take your word on it for now. Still, it seems like none of the academic papers put all of the pieces together in a single decision theory proposal that’s equivalent to UDT or FDT?
(Please note that UDT as originally described was actually updateless/evidential/logical, not causalist as you wrote in the post that you linked. This has been a historical disagreement between me and Eliezer, where in I leaned towards evidential and he leans towards causal, although these days I just say that I’m confused and don’t know what to think.)
The reason for the former is that I (and others) have been unable to find a rigorous formulation of it that doesn’t have serious open problems. (I and I guess other decision theory researchers in this community currently think that UDT is more of a relatively promising direction to explore, rather than a good decision theory per se.)
That’s fair. But what is it then that you expect academics to engage with? How would you describe this research direction, and why do you think it’s interesting and/or important?
To quickly recap the history, people on LW noticed some clear issues with “updating” and “physicalist ontology” of the most popular decision theories at the time (CDT/EDT), and thought that switching to “updatelessness” and “logical/algorithmic ontology” was an obvious improvement. (I was the first person to put the two pieces together in an explicit formulation, but they were already being talked about / hinted at in the community.) Initially people were really excited because the resulting decision theories (UDT/FDT) seemed to solve a lot of open problems in one swoop, but then pretty quickly and over time we noticed more and more problems with UDT/FDT that seem to have no clear fixes.
So we were initially excited but then increasingly puzzled/confused, and I guess I was expecting at least some academics to follow a similar path, either through engagement with LW ideas (why should they be bothered that much by lack of academic publication?), or from independent invention. Instead academia seems to still be in a state similar to LW when I posted UDT, i.e., the ideas are floating in the air separately and nobody has put them together yet? (Or I guess that was the state of academia before FDT was published in an academic journal, so now the situation is more like some outsiders put the pieces together in a formal publication, but still no academic is following a similar path as us.)
I guess it’s also possible that academia sort of foresaw or knew all the problems that we’d eventually find with UDT/FDT and that’s why they didn’t get excited in the first place. I haven’t looked into academic DT literature in years, so you’re probably more familiar with it. Do you know if they’re puzzled/confused by the same problems that we are? Or what are they mostly working on / arguing about these days?
There are many many interesting questions in decision theory, and “dimensions” along which decision theories can vary, not just the three usually discussed on LessWrong. It’s not clear to me why (i) philosophers should focus on the dimensions you primarily seem to be interested in, and (ii) what is so special about the particular combination you mention (is there some interesting interaction I don’t know about maybe?). Furthermore, note that most philosophers probably do not share your intuitions: I’m pretty sure most of them would e.g. pay in counterfactual mugging. (And I have not seen a good case for why it would be rational to pay.) I don’t mean to be snarky, but you could just be wrong about what the open problems are.
I haven’t looked into academic DT literature in years, so you’re probably more familiar with it. Do you know if they’re puzzled/confused by the same problems that we are?
I wouldn’t say so, no. But I’m not entirely sure if I understand what the open problems are. Reading your list of seven issues, I either (i) don’t understand what you are asking, (ii) disagree with the framing/think the question is misguided, or (iii) think there is an obvious answer (which makes me think that I’m missing something). With that said, I haven’t read all the posts you reference, so perhaps I should read those first.
There are many many interesting questions in decision theory, and “dimensions” along which decision theories can vary, not just the three usually discussed on LessWrong.
It would be interesting to get an overview of what these are. Or if that’s too hard to write down, and there are no ready references, what are your own interests in decision theory?
what is so special about the particular combination you mention
Furthermore, note that most philosophers probably do not share your intuitions
Agreed, but my intuitions don’t seem so unpopular outside academia or so obviously wrong that there should be so few academic philosophers who do share them.
I’m pretty sure most of them would e.g. pay in counterfactual mugging. (And I have not seen a good case for why it would be rational to pay.)
I’m not sure I wouldn’t pay either. I see it as more of an interesting puzzle than having a definitive answer. ETA: Although I’m more certain that we should build AIs that do pay. Is that also unclear to you? (If so why might we not want to build such AIs?)
I don’t mean to be snarky, but you could just be wrong about what the open problems are.
Yeah, I’m trying to keep an open mind about that. :)
With that said, I haven’t read all the posts you reference, so perhaps I should read those first.
Cool, I’d be interested in any further feedback when you’re ready to give them.
It would be interesting to get an overview of what these are. Or if that’s too hard to write down, and there are no ready references, what are your own interests in decision theory?
I’m not sure I wouldn’t pay either. I see it as more of an interesting puzzle than having a definitive answer. ETA: Although I’m more certain that we should build AIs that do pay. Is that also unclear to you? (If so why might we not want to build such AIs?)
Okay, interesting! I thought UDT was meant to pay in CM, and that you were convinced of (some version of) UDT.
On the point about AI (not directly responding to your question, to which I don’t have an answer): I think it’s really important to be clear about whether we are discussing normative, constructive or descriptive decision theory (using Elliott Thornley’s distinction here). For example, the answers to “is updatelessness normatively compelling?”, “should we build an updateless AI?” and “will some agents (e.g. advanced AIs) commit to being updateless?” will most likely come apart (it seems to me). And I think that discussions on LW about decision theory are often muddled due to not making clear what is being discussed.
(BTW this issue/doubt about whether UDT / paying CM is normative for humans is item 1 in the above linked post. Thought I’d point that out since it may not be obvious at first glance.)
And I think that discussions on LW about decision theory are often muddled due to not making clear what is being discussed.
Yeah I agree with this to some extent, and try to point out such confusions or make such distinctions when appropriate. (Such as in the CM / indexical values case.) Do you have more examples where making such distinctions would be helpful?
I wrote “I’m really not sure at this point whether UDT is even on the right track” in UDT shows that decision theory is more puzzling than ever which I think you’ve read? Did you perhaps miss that part?
Yes, missed or forgot about that sentence, sorry.
(BTW this issue/doubt about whether UDT / paying CM is normative for humans is item 1 in the above linked post. Thought I’d point that out since it may not be obvious at first glance.)
Thanks.
Do you have more examples where making such distinctions would be helpful?
I was mostly thinking about discussions surrounding what the “correct” decision theory, is whether you should pay in CM, and so on.
Thanks Sylvester! Yep it looks like cohesive decision theory is basically original UDT. Do you know what the state of the art is in terms of philosophical critiques of cohesive decision theory? Any good ones? Any good responses to the critiques?
Cohesive decision theory lacks the logical/algorithmic ontology of UDT and is closer to what we call “updateless EDT/CDT” (the paper itself talks about cohesive versions of both).
Also interested in a response from Sylvester, but I would guess that one of the main critiques is something like Will MacAskill’s Bomb thought experiment, or just intuitions for paying the counterfactual mugger. From my perspective, these do have a point when it comes to humans, since humans seemingly have indexical values, and one way to explain why UDT makes recommendations in these thought experiments that look “bizarre” to many humans is that it assumes away indexical values (via the type signature of its utility function). (It was an implicit and not totally intentional assumption, but it’s unclear how to remove the assumption while retaining nice properties associated with updatelessness.) I’m unsure if indexical values themselves are normative or philosophically justified, and they are probably irrelevant or undesirable when it comes to AIs, but I guess academic philosophers probably take them more for granted and are not as interested in AI (and therefore take a dimmer view on updatelessness/cohesiveness).
But yeah, if there are good critiques/responses aside from these, it would be interesting to learn them.
I don’t think cohesive decision theory is being discussed much, but I’m not sure. Perhaps because the theory is mainly used to argue against the claim that “every decision rule will lead agents who can’t bind themselves to disaster” (p. 20, footnote 34) in the paper, and discussion of its independent interest is relegated to a footnote (footnote 34).
Something about the aesthetics of what makes a “good” peer-reviewed physics paper, based on my personal experience, probably including my anecdote about solar cell R&D from here
I followed this link thinking that it looks relevant to my question, but the way Holden delineates what academia is interested in, it should totally be interested in my ideas:
I, today, when I think about what academia does, I think it is really set up to push the frontier of knowledge, the vast majority, and I think especially in the harder sciences. I would say the vast majority of what is going on in academic is people are trying to do something novel, interesting, clever, creative, different, new, provocative, that really pushes the boundaries of knowledge forward in a new way.
versus what Holden says are important questions that academia neglects:
There’s an intellectual topic, it’s really important to the world but it’s not advancing the frontier of knowledge. It’s more figuring out something in a pragmatic way that is going to inform what decision makers should do, and also there’s no one decision maker asking for it as would be the case with Government or corporations.
The rest of your comment seems to be hinting that maybe academia is ignoring my ideas because it doesn’t like the aesthetics of my writing? (Not sure if that was your point, or if those bullet points weren’t supposed to be directly related to my question...) Even if that’s true though, I’m still puzzled why academia hasn’t reinvented any of my ideas (which have been independently invented multiple times outside of academia, e.g. Nick Szabo and Satoshi with b-money, Paul Christiano with UDASSA).
Hmm, yeah I guess what I wrote wasn’t too directly helpful for your question.
the way Holden delineates what academia is interested in, it should totally be interested in my ideas…
I think Holden forgot “trendy”. Trendy is very important. I think people in academia have a tacit shared understanding of the currently-trending topics / questions, within which there’s a contest to find interesting new ideas / progress. If an idea is important but not trendy, it’s liable to get neglected, I think. It’s kinda like in clothing fashion: if you find a brilliant use of beads, but beads aren’t fashion-forward this year, roughly nobody will care.
Of course, the trends change, and indeed everyone is trying to be the pioneer of the next hot topic. There are a lot of factors that go into “what is the next hot topic”, including catching the interest of a critical mass of respected people (or people-who-control-funding), which in turn involves them feeling it’s “exciting”, and that they themselves have an angle for making further progress in this area, etc. But trendiness doesn’t systematically track objective importance, and it’s nobody’s job to make it so.
At least, that’s what things felt like to me in the areas of physics I worked in (optics, materials science, and related). I’m much less familiar with philosophy, economics, etc.
Remember, aside from commercially-relevant ideas, success for academia research scientists (and philosophers) is 100% determined by “am I impressing my peers?”—grants, promotions, invited talks, etc. are all determined by that. So if I write a paper and the prestigious people in my field are unanimously saying “I don’t know about that thing, it’s not an area that I know or care about”, the result is just as bad for me and my career as if those people had unanimously said “this is lousy work”.
it doesn’t like the aesthetics of my writing
To be clear, when I said “the aesthetic of what makes a good X”, I meant it in a really broad sense. Maybe I should have said “the implicit criteria of what makes a good X” instead. So “the paper concerns a currently-trendy topic” can be part of that, even though it’s not really “aesthetics” in the sense of beauty. E.g., “the aesthetic of what makes a good peer-reviewed experimental condensed-matter physics paper” has sometimes been greatly helped by “it somehow involves nanotechnology”.
From the years in academia studying neuroscience and related aspects of bioengineering and medicine development… yeah. So much about how effort gets allocated is not ‘what would be good for our country’s population in expectation, or good for all humanity’. It’s mostly about ‘what would make an impressive sounding research paper that could get into an esteemed journal?’, ‘what would be relatively cheap and easy to do, but sound disproportionately cool?’, ‘what do we guess that the granting agency we are applying to will like the sound of?’. So much emphasis on catching waves of trendiness, and so little on estimating expected value of the results.
Research an unprofitable preventative-health treatment which plausibly might have significant impacts on a wide segment of the population? Booooring.
Research an impractically-expensive-to-produce fascinatingly complex clever new treatment for an incredibly rare orphan disease? Awesome.
Aside from the literature on international relations, I don’t know much about academic dysfunction (mostly from reading parts of Inadequate Equilibria, particularly the visitor dialog) and other Lesswrong people can probably cover it better. I think that planecrash, Yud’s second HPMOR-scale work, mentions that everyone in academia just generally avoids citing things published outside of academia, because they risk losing status if they do.
It turns out that Earth economists are locked into powerful incentive structures of status and shame, which prevent them from discussing the economic work of anybody who doesn’t get their paper into a journal. The journals are locked into very powerful incentive structures that prevent them from accepting papers unless they’re written in a very weird Earth way that Thellim can’t manage to imitate, and also, Thellim hasn’t gotten tenure at a prestigious university which means they’ll probably reject the paper anyways. Thellim asks if she can just rent temporary tenure and buy somebody else’s work to write the paper, and gets approximately the same reaction as if she asked for roasted children recipes.
The system expects knowledge to be contributed to it only by people who have undergone painful trials to prove themselves worthy. If you haven’t proven yourself worthy in that way, the system doesn’t want your knowledge even for free, because, if the system acknowledged your contribution, it cannot manage not to give you status, even if you offer to sign a form relinquishing it, and it would be bad and unfair for anyone to get that status without undergoing the pains and trials that others had to pay to get it.
She went and talked about logical decision theory online before she’d realized the full depth of this problem, and now nobody else can benefit from writing it up, because it would be her idea and she would get the status for it and she’s not allowed to have that status. Furthermore, nobody else would put in the huge effort to push forward the idea if she’ll capture their pay in status. It does have to be a huge effort; the system is set up to provide resistance to ideas, and disincentivize people who quietly agreed with those ideas from advocating them, until that resistance is overcome. This ensures that pushing any major idea takes a huge effort that the idea-owner has to put in themselves, so that nobody will be rewarded with status unless they have dedicated several years to pushing an idea through a required initial ordeal before anyone with existing status is allowed to help, thereby proving themselves admirable enough and dedicated enough to have as much status as would come from contributing a major idea.
To suggest that the system should work in any different way is an obvious plot to steal status that is only deserved by virtuous people who work hard, play by the proper rules, and don’t try to cheat by doing anything with less effort than it’s supposed to take.
It’s glowfic, so of course I don’t know how accurate it is as it’s intended to plausibly deniable enough to facilitate free writing (while keeping things entertaining enough to register as not-being-work).
I have to think more about the status dynamics that Eliezer talked about. There’s probably something to it… But this part stands out as wrong or at least needing nuance/explanation:
Thellim hasn’t gotten tenure at a prestigious university which means they’ll probably reject the paper anyways
I think most academic venues do blind reviews and whoever decides whether or not to accept a paper isn’t supposed to know who wrote it? Which isn’t to say that the info won’t leak out anyway and influence the decision. (For example I once left out the acknowledgements section in a paper submission, thinking that, like the author byline, I was supposed to add it after the paper was accepted, but apparently I was actually supposed to include it and someone got really peeved that I didn’t.)
MIRI suggested I point out that Cheating Death In Damascus had recently been accepted in The Journal of Philosophy, a top philosophy journal, as evidence of (hopefully!) mainstream philosophical engagement.
From talking with people who do work on a lot of grant committees in the NIH and similar funding orgs, it’s really hard to do proper blinding of reviews. Certain labs tend to focus on particular theories and methods, repeating variations of the same idea… So if you are familiar the general approach of a particular lab and it’s primary investigator, you will immediately recognize and have a knee-jerk reaction (positive or negative) to a paper which pattern-matches to the work that that lab / subfield is doing.
Common reactions from grant reviewers:
Positive—“This fits in nicely with my friend Bob’s work. I respect his work, I should argue for funding this grant.”
Neutral—“This seems entirely novel to me, I don’t recognize it as connecting with any of the leading trendy ideas in the field or any of my personal favorite subtopics. Therefore, this seems high risk and I shouldn’t argue too hard for it.”
Slightly negative—“This seems novel to me, and doesn’t sound particularly ‘jargon-y’ or technically sophisticated. Even if the results would be beneficial to humanity, the methods seem boring and uncreative. I will argue slightly against funding this.”
Negative—“This seems to pattern match to a subfield I feel biased against. Even if this isn’t from one of Jill’s students, it fits with Jill’s take on this subtopic. I don’t want views like Jill’s gaining more traction. I will argue against this regardless of the quality of the logic and preliminary data presented in this grant proposal.”
I will self-downvote so this isn’t the top comment. Yud’s stuff is neat, but I haven’t read much on the topic, and passing some along when it comes up has been a good general heuristic.
I think that UDASSA and UDT might be in academia’s blind spots in the same way that the Everett interpretation is: more correct theories that came after less correct theories with mostly only theoretical evidence to support changing over to the new theories.
Many parts of academia have a strong Not Invented Here tendency. Not just research outside of academia is usually ignored, but even research outside a specific academic citation bubble, even if another bubble investigates a pretty similar issue. For example, economic decision theorists ignore philosophical decisions theorists, which in turn mostly ignore the economic decision theorists. They each have their own writing style and concerns and canonical examples or texts. Which makes it hard for outsiders to read the literature or even contribute to it, so they don’t.
A striking example is statistics, where various fields talk about the same mathematical thing with their own idiosyncratic names, unaware or unconcerned whether it already had a different name elsewhere.
Edit: Though LessWrong is also a citation bubble to some degree.
Something a better , future version of rationalism could do is build bridges and facilitate communication between these little bubbles. The answet-to-everything approach has been tried too many times.
Indeed, there is no need for sorrow, for by choosing to remain anonymous, you have done great things. The world owes you a Nobel Prize in Economics and a Turing Award. It is time for the world to seriously recognize your achievements and lead it towards a financial system without bubbles.
Why you hadn’t wrote academic articles on these topics?
The secret is that academic article is just a formatting type and anyone can submit to scientific journals. No need to have a PhD or even work in a scientific institution.
I wrote an academic-style paper once, as part of my job as an intern in a corporate research department. It soured me on the whole endeavor, as I really didn’t enjoy the process (writing in the academic style, the submission process, someone insisting that I retract the submission to give them more credit despite my promise to insert the credit before publication), and then it was rejected with two anonymous comments indicating that both reviewers seemed to have totally failed to understand the paper and giving me no chance to try to communicate with them to understand what caused the difficulty. The cherry on top was my mentor/boss indicating that this is totally normal, and I was supposed to just ignore the comments and keep resubmitting the paper to other venues until I run out of venues.
My internship ended around that point and I decided to just post my ideas to mailing lists / discussion forums / my home page in the future.
Also, I think MIRI got FDT published in some academic philosophy journal, and AFAIK nothing came of it?
There is some similarity between UDASSA and ’Law without law” by Mueller, as both use Kolmogorov complexity to predict the distribution of observers. In LwL there is not any underlying reality except numbers, so it is just dust theory over random number fields.
Clearly academia has some blind spots, but how big? Do I just have a knack for finding ideas that academia hates, or are the blind spots actually enormous?
From someone who left a corner of it: the blindspots could be arbitrarily large as far as I know, because there seemed to me to be no real explicit culture of Hamming questions/metalooking for anything neglected. You worked on something vaguely similar/related to your advisor’s work, because otherwise you can’t get connections to people who know how to attack the problem.
What is going on with Constitution AI? Does anyone know why no LLM aside from Claude (at least none that I can find) has used it? One would think that if it works about as well as RLHF (which it seems to), AI companies would be flocking to it to save on the cost of human labor?
Also, apparently ChatGPT doesn’t know that Constitutional AI is RLAIF (until I reminded it) and Gemini thinks RLAIF and RLHF are the same thing. (Apparently not a fluke as both models made the same error 2 out of 3 times.)
Isn’t the basic idea of Constitutional AI just having the AI provide its own training feedback using written instruction? My guess is there was a substantial amount of self-evaluation in the o1 training with complicated written instructions, probably kind of similar to a constituion (though this is just a guess).
This is my impression too. See e.g. this recent paper from Google, where LLMs critique and revise their own outputs to improve performance in math and coding.
The details of Constitutional AI seem highly contingent, while the general idea is simply automation of data for post-training, so that the remaining external input is the “constitution”. In the original paper there are recipes both for instruction tuning data and for preference data. RLAIF is essentially RLHF that runs on synthetic preference data, maybe together with a recipe for generating it. But preference data could also be used to run DPO or something else, in which case RLAIF becomes a misnomer for describing automation of that preference data.
Llama 3 report suggests that instruction tuning data can be largely automated, but human preference data is still better. And data foundry business is still alive, so a lot of human data is at least not widely recognized as useless. But it’s unclear if future models won’t soon do better than humans at labeling, or possibly already do better at some leading labs. Meta didn’t have a GPT-4 level model as a starting point before Llama 3, and then there are the upcoming 5e26 FLOPs models, and o1-like reasoning models.
As a tangent to my question, I wonder how many AI companies are already using RLAIF and not even aware of it. From a recent WSJ story:
Early last year, Meta Platforms asked the startup to create 27,000 question-and-answer pairs to help train its AI chatbots on Instagram and Facebook.
When Meta researchers received the data, they spotted something odd. Many answers sounded the same, or began with the phrase “as an AI language model…” It turns out the contractors had used ChatGPT to write-up their responses—a complete violation of Scale’s raison d’être.
So they detected the cheating that time, but in RLHF how would they know if contractors used AI to select which of two AI responses is more preferred?
BTW here’s a poem(?) I wrote for Twitter, actually before coming across the above story:
The people try to align the board. The board tries to align the CEO. The CEO tries to align the managers. The managers try to align the employees. The employees try to align the contractors. The contractors sneak the work off to the AI. The AI tries to align the AI.
Maybe others are using it in secret but don’t want to admit it for some reason? I can’t find any mention of Anthropic having filed a patent on the idea, but maybe other companies would feel too much like it looked like they were second-rate imitators if they said they were copying Anthropic’s idea?
Just speculating, I don’t know. Sure seems like a useful idea to copy.
I’m increasingly worried that philosophers tend to underestimate the difficulty of philosophy. I’ve previously criticized Eliezer for this, but it seems to be a more general phenomenon.
Observations:
Low expressed interest in metaphilosophy (in relation to either AI or humans)
Low expressed interest in AI philosophical competence (either concern that it might be low, or desire/excitement for supercompetent AI philosophers with Jupiter-sized brains)
Low concern that philosophical difficulty will be a blocker of AI alignment or cause of AI risk
High confidence when proposing novel solutions (even to controversial age-old questions, and when the proposed solution fails to convince many)
Rarely attacking one’s own ideas (in a serious or sustained way) or changing one’s mind based on others’ arguments
Rarely arguing for uncertainty/confusion (i.e., that that’s the appropriate epistemic status on a topic), with normative ethics being a sometime exception
Possible explanations:
General human overconfidence
People who have a high estimate of difficulty of philosophy self-selecting out of the profession.
Academic culture/norms—no or negative rewards for being more modest or expressing confusion. (Moral uncertainty being sometimes expressed because one can get rewarded by proposing some novel mechanism for dealing with it.)
Philosophy is frequently (probably most of the time) done in order to signal group membership rather than as an attempt to accurately model the world. Just look at political philosophy or philosophy of religion. Most of the observations you note can be explained by philosophers operating at simulacrum level 3 instead of level 1.
“Signal group membership” may be true of the fields you mentioned (political philosophy and philosophy of religion), but seems false of many other fields such as philosophy of math, philosophy of mind, decision theory, anthropic reasoning. Hard to see what group membership someone is signaling by supporting one solution to Sleeping Beauty vs another, for example.
Here are some axes along which I think there’s some group membership signaling in philosophy (IDK about the extent and it’s hard to disentangle it from other stuff):
Math: platonism/intuitionism/computationalism (i.e. what is math?), interpretations of probability, foundations of math (set theory vs univalent foundations)
Mind: externalism/internalism (about whatever), consciousness (de-facto-dualisms (e.g. Chalmers) vs reductive realism vs illusionism), language of thought vs 4E cognition, determinism vs compatibilism vs voluntarism
Metaphysics/ontology: are chairs, minds, and galaxies real? (this is somewhat value-laden for many people)
Biology: gene’s-eye-view/modern synthesis vs extended evolutionary synthesis
I don’t think this is accurate, I think most philosophy is done under motivated reasoning but is not straightforwardly about signaling group membership
I think most academic philosophers take the difficult of philosophy quite seriously. Metaphilosophy is a flourishing subfield of philosophy; you can find recent papers on the topic here https://philpapers.org/browse/metaphilosophy. There is also a growing group of academic philosophers working on AI safety and alignment; you can find some recent work here https://link.springer.com/collections/cadgidecih. I think that sometimes the tone of specific papers sounds confident; but that is more stylistic convention than a reflection of the underlying credences. Finally, I think that uncertainty / decision theory is a persistent theme in recent philosophical work on AI safety and other issues in philosophy of AI; see for example this paper, which is quite sensitive to issues about chances of welfare https://link.springer.com/article/10.1007/s43681-023-00379-1.
Thank you for your view from inside academia. Some questions to help me get a better sense of what you see:
Do you know any philosophers who switched from non-meta-philosophy to metaphilosophy because they become convinced that the problems they were trying to solve are too hard and they needed to develop a better understanding of philosophical reasoning or better intellectual tools in general? (Or what’s the closest to this that you’re aware of?)
Do you know any philosophers who have expressed an interest in ensuring that future AIs will be philosophically competent, or a desire/excitement for supercompetent AI philosophers? (I know 1 or 2 private expressions of the former, but not translated into action yet.)
Do you know any philosophers who are worried that philosophical problems involved in AI alignment/safety may be too hard to solve in time, and have called for something like an AI pause to give humanity more time to solve them? (Even philosophers who have expressed a concern about AI x-risk or are working on AI safety have not taken a position like this, AFAIK.)
How often have you seen philosophers say something like “Upon further reflection, my proposed solution to problem X has many problems/issues, I’m no longer confident it’s the right approach and now think X is much harder than I originally thought.”
Would also appreciate any links/citations/quotes (if personal but sharable communications) on these.
These are all things I’ve said or done due to high estimate of philosophical difficulty, but not (or rarely) seen among academic philosophers, at least from my casual observation from outside academia. It’s also possible that we disagree on what estimate of philosophical difficulty is appropriate (such that for example you don’t think philosophers should often say or do these things), which would also be interesting to know.
Another academic philosopher, directed here by @Simon Goldstein. Hello Wei!
It’s not common to switch entirely to metaphilosophy, but I think lots of us get more interested in the foundations and methodology of at least our chosen subfields as we gain experience, see where progress is(n’t) being made, start noticing deep disagreements about the quality of different kinds of work, and so on. It seems fair to describe this as awakening to a need for better tools and a greater understanding of methods. I recently wrote a paper about the methodology of one of my research areas, philosophy of mathematical practice, for pretty much these reasons.
Current LLMs are pretty awful at discussing the recent philosophy literature, so I think anyone who’d like AI tools to serve as useful research assistants would be happy to see at least some improvement here! I’m personally also excited about the prospects of using language models with bigger context windows for better corpus analysis work in empirical and practice-oriented parts of philosophy.
I basically agree with Simon on this.
I don’t think this is uncommon. You might not see these reversals in print often, because nobody wants to publish and few people want to read a paper that just says “I retract my previous claims and no longer have a confident positive view to offer”. But my sense is that philosophers often give up on projects because the problems are piling up and they no longer see an appealing way forward. Sometimes this happens more publicly. Hilary Putnam, one of the most influential philosophers of the later 20th century, was famous for changing his mind about scientific realism and other basic metaphysical issues. Wesley Salmon gave up his influential “mark transmission” account of causal explanation due to counterexamples raised by Kitcher (as you can read here). It would be easy enough to find more examples.
Great questions. Sadly, I don’t have any really good answers for you.
I don’t know of specific cases, but for example I think it is quite common for people to start studying meta-ethics because of frustration at finding answers to questions in normative ethics.
I do not, except for the end of Superintelligence
Many of the philosophers I know who work on AI safety would love for there to be an AI pause, in part because they think alignment is very difficult. But I don’t know if any of us have explicitly called for an AI pause, in part because it seems useless, but may have opportunity cost.
I think few of my friends in philosophy have ardently abandoned a research project they once pursued because they decided it wasn’t the right approach. I suspect few researchers do that. In my own case, I used to work in an area called ‘dynamic semantics’, and one reason I’ve stopped working on that research project is that I became pessimistic that it had significant advantages over its competitors.
Sadly, I don’t have any really good answers for you.
Thanks, it’s actually very interesting and important information.
I don’t know of specific cases, but for example I think it is quite common for people to start studying meta-ethics because of frustration at finding answers to questions in normative ethics.
I’ve noticed (and stated in the OP) that normative ethics seems to be an exception where it’s common to express uncertainty/confusion/difficulty. But I think, from both my inside and outside views, that this should be common in most philosophical fields (because e.g. we’ve been trying to solve them for centuries without coming up with broadly convincing solutions), and there should be a steady stream of all kinds of philosophers going up the meta ladder all the way to metaphilosophy. It recently dawned on me that this doesn’t seem to be the case.
Many of the philosophers I know who work on AI safety would love for there to be an AI pause, in part because they think alignment is very difficult. But I don’t know if any of us have explicitly called for an AI pause, in part because it seems useless, but may have opportunity cost.
What seems useless, calling for an AI pause, or the AI pause itself? Have trouble figuring out because if “calling for an AI pause”, what is the opportunity cost (seems easy enough to write or sign an open letter), and if “AI pause itself”, “seems useless” contradicts “would love”. In either case, this seems extremely important to openly discuss/debate! Can you please ask these philosophers to share their views of this on LW (or their preferred venue), and share your own views?
FTR I’d probably be up for helping out logistically with such an open letter (e.g. making the website and any other parts of it). I previously made this open letter.
Sorry for being unclear, I meant that calling for a pause seems useless because it won’t happen. I think calling for the pause has opportunity cost because of limited attention and limited signalling value; reputation can only be used so many times; better to channel pressure towards asks that could plausibly get done.
I think there’s a steady stream of philosophy getting interested in various questions in metaphilosophy
Thanks for this info and the references. I guess by “metaphilosophy” I meant something more meta than metaethics or metaepistemology, i.e., a field that tries to understand all philosophical reasoning in some unified or systematic way, including reasoning used in metaethics and metaepistemology, and metaphilosophy itself. (This may differ from standard academic terminology, in which case please let me know if there’s a preferred term for the concept I’m pointing at.) My reasoning being that metaethics itself seems like a hard problem that has defied solution for centuries, so why stop there instead of going even more meta?
Sorry for being unclear, I meant that calling for a pause seems useless because it won’t happen.
I think you (and other philosophers) may be too certain that a pause won’t happen, but I’m not sure I can convince you (at least not easily). What about calling for it in a low cost way, e.g., instead of doing something high profile like an open letter (with perceived high opportunity costs), just write a blog post or even a tweet saying that you wish for an AI pause, because …? What if many people privately prefer an AI pause, but nobody knows because nobody says anything? What if by keeping silent, you’re helping to keep society in a highly suboptimal equilibrium?
I think there are also good arguments for doing something like this from a deontological or contractualist perspective (i.e. you have a duty/obligation to honestly and publicly report your beliefs on important matters related to your specialization), which sidestep the “opportunity cost” issue, but I’m not sure if you’re open to that kind of argument. I think they should have some weight given moral uncertainty.
Hm. I think modern academic philosophy is a raging shitshow, but I thought philosophy on LW was quite good. I haven’t been a regular LW user until a couple of years ago, and the philosophical takes here, particularly Eliezer’s, converge with my own conclusions after a half lifetime of looking at philosophical questions through the lens of science, particularly neuroscience and psychology.
So: what do you see as the limitations in LW/Yudkowskian philosophy? Perhaps I’ve overlooked them.
I am currently skeptical that we need better philosophy for good AGI outcomes, vs. better practical work on technical AGI alignment (a category that barely exists) and PR work to put the likely personal intent aligned AGI into the hands of people that give half a crap about understanding or implementing ethics. Deciding on the long term future will be a matter of a long contemplation if we get AGI into good hands. We should decide if that logic is right, and if so, plan the victory party after we’ve won the war.
I did read your metaphilosophy post and remain unconvinced that there’s something big the rest of us are missing.
I’m happy to be corrected (I love becoming less wrong, and I’m aware of many of my biases that might prevent it):
Here’s how it currently looks to me: Ethics are ultimately a matter of preference, the rest is game theory and science (including the science of human preferences). Philosophical questions boil down to scientific questions in most cases, so epistemology is metaphilosophy for the most part.
Change my mind! Seriously, I’ll listen. It’s been years since I’ve thought about philosophy hard.
I was just reading Daniel Dennett’s memoir for no reason in particular, it had some interesting glimpses into how professional philosophers actually practice philosophy. Like I guess there’s a thing where one person reads their paper (word-for-word!) and then someone else is the designated criticizer? I forget the details. Extremely different from my experience in physics academia though!!
(Obviously, reading that memoir is probably not the most time-efficient way to learn about the day-to-day practice of academic philosophy.)
(Oh, there was another funny anecdote in the memoir where the American professional philosopher association basically had a consensus against some school of philosophy, and everyone was putting it behind them and moving on, but then there was a rebellion where the people who still liked that school of philosophy did a hostile takeover of the association’s leadership!)
Academic culture/norms—no or negative rewards for being more modest or expressing confusion. (Moral uncertainty being sometimes expressed because one can get rewarded by proposing some novel mechanism for dealing with it.)
A non-ethics example that jumps to my mind is David Chalmers on the Hard Problem of Consciousness here: “So if I’m giving my overall credences, I’m going to give, 10% to illusionism, 30% to panpsychism, 30% to dualism, and maybe the other 30% to, I don’t know what else could be true, but maybe there’s something else out there.” That’s the only example I can think of but I read very very little philosophy.
I genuinely don’t know what you want elaboration of. Reacts are nice for what they are, but saying something out loud about what you want to hear more about / what’s confusing / what you did and didn’t understand/agree with, is more helpful.
Re/ “to whom not...”, I’m asking Wei: what groups of people would not be described by the list of 6 “underestimating the difficult of philosophy” things? It seems to me that broadly, EAs and “AI alignment” people tend to favor somewhat too concrete touchpoints like “well, suppressing revolts in the past has gone like such and such, so we should try to do similar for AGI”. And broadly they don’t credit an abstract argument about why something won’t work, or would only work given substantial further philosophical insight.
Re/ “don’t think thinking …”, well, if I say “LLMs basically don’t think”, they’re like “sure it does, I can keep prompting it and it says more things, and I can even put that in a scaffold” or “what concrete behavior can you point to that it can’t do”. Like, bro, I’m saying it can’t think. That’s the tweet. What thinking is, isn’t clear, but That thinking is should be presumed, pending a forceful philosophical conceptual replacement!
That is, in fact, a helpful elaboration! When you said
Most people who “work on AI alignment” don’t even think that thinking is a thing.
my leading hypotheses for what you could mean were:
Using thought, as a tool, has not occured to most such people
Most such people have no concept whatsoever of cognition as being a thing, the way people in the year 1000 had no concept whatsoever of javascript being a thing.
Now, instead, my leading hypothesis is that you mean:
Most such people are failing to notice that there’s an important process, called “thinking”, which humans do but LLMs “basically” don’t do.
This is a bunch more precise! For one, it mentions AIs at all.
As my reacts hopefully implied, this is exactly the kind of clarification I needed—thanks!
Like, bro, I’m saying it can’t think. That’s the tweet. What thinking is, isn’t clear, but That thinking is should be presumed, pending a forceful philosophical conceptual replacement!
Sure, but you’re not preaching to the choir at that point. So surely the next step in that particular dance is to stick a knife in the crack and twist?
That is -
“OK, buddy:
Here’s property P (and if you’re good, Q and R and...) that [would have to]/[is/are obviously natural and desirable to]/[is/are pretty clearly a critical part if you want to] characterize ‘thought’ or ‘reasoning’ as distinct from whatever it is LLMs do when they read their own notes as part of a new prompt and keep chewing them up and spitting the result back as part of the new prompt for itself to read.
Here’s thing T (and if you’re good, U and V and...) that an LLM cannot actually do, even in principle, which would be trivially easy for (say) an uploaded (and sane, functional, reasonably intelligent) human H could do, even if H is denied (almost?) all of their previously consolidated memories and just working from some basic procedural memory and whatever Magical thing this ‘thinking’/‘reasoning’ thing is.”
And if neither you nor anyone else can do either of those things… maybe it’s time to give up and say that this ‘thinking’/‘reasoning’ thing is just philosophically confused? I don’t think that that’s where we’re headed, but I find it important to explicitly acknowledge the possibility; I don’t deal in more than one epiphenomenon at a time and I’m partial to Platonism already. So if this ‘reasoning’ thing isn’t meaningfully distinguishable in some observable way from what LLMs do, why shouldn’t I simply give in?
I’ve had this tweet pinned to my Twitter profile for a while, hoping to find some like-minded people, but with 13k views so far I’ve yet to get a positive answer (or find someone expressing this sentiment independently):
Among my first reactions upon hearing “artificial superintelligence” were “I can finally get answers to my favorite philosophical problems” followed by “How do I make sure the ASI actually answers them correctly?”
Anyone else reacted like this?
This aside, there are some people around LW/rationality who seem more cautious/modest/self-critical about proposing new philosophical solutions, like MIRI’s former Agent Foundations team, but perhaps partly as a result of that, they’re now out of a job!
Having worked on some of the problems myself (e.g. decision theory), I think the underlying problems are just very hard. Why do you think they could have done “so much more, much more intently, and much sooner”?
The type of fundamental problem that proper speculative philosophy is supposed to solve is the sort where streetlighting doesn’t work (or isn’t working, or isn’t working fast enough). But nearly all of the alignment field after like 2004 was still basically streetlighting. It was maybe a reasonable thing to have some hope in prospectively, but retrospectively it was too much investment in streetlighting, and retrospectively I can make arguments about why one should have maybe guessed that at the time. By 2018 IIRC, or certainly by 2019, I was vociferously arguing for that in AF team meetings—but the rest of the team either disagreed with me or didn’t understand me, and on my own I’m just not that good a thinker, and I didn’t find anyone else to try it with. I think they have good thoughts, but are nevertheless mostly streetlighting—i.e. not trying to take step after step of thinking at the level of speculative philosophy AND aimed at getting the understanding needed for alignment.
My understanding of what happened (from reading this) is that you wanted to explore in a new direction very different from the then preferred approach of the AF team, but couldn’t convince them (or someone else) to join you. To me this doesn’t clearly have much to do with streetlighting, and my current guess is that it was probably reasonable of them to not be convinced. It was also perfectly reasonable of you to want to explore a different approach, but it seems unreasonable to claim without giving any details that it would have produced better results if only they had listened to you. (I mean you can claim this, but why should I believe you?)
If you disagree (and want to explain more), maybe you could either explain the analogy more fully (e.g., what corresponds to the streetlight, why should I believe that they overexplored the lighted area, what made you able to “see in the dark” to pick out a more promising search area or did you just generally want to explore the dark more) and/or try to convince me on the object level / inside view that your approach is or was more promising?
(Also perfectly fine to stop here if you want. I’m pretty curious on both the object and meta levels about your thoughts on AF, but you may not have wanted to get into such a deep discussion when you first joined this thread.)
Ok, so, there’s this thing about AGI killing everyone. And there’s this idea of avoiding that by making AGI that’s useful like an AGI but doesn’t kill everyone and does stuff we like. And you say you’re working on that, or want to work on that. And what you’re doing day to day is {some math thing, some programming thing, something about decision theory, …}. What is the connection between these things?
and then you listen to what they say, and reask the question and interrogate their answers, IME what it very often grounds out into is something like:
Well, I don’t know what to do to make aligned AI. But it seems like X ϵ {ontology, decision, preference function, NN latent space, logical uncertainty, reasoning under uncertainty, training procedures, negotiation, coordination, interoperability, planning, …} is somehow relevant.
And, I have a formalized version of some small aspect of X in which is mathematically interesting / philosophically intriguing / amenable to testing with a program, and which seems like it’s kinda related to X writ large. So what I’m going to do, is I’m going to tinker with this formalized version for a week/month/year, and then I’m going to zoom out and think about how this relates to X, and what I have and haven’t learned, and so on.
This is a good strategy because this is how all mathematical / scientific / technological progress is made: you start with stuff you know; you expand outwards by following veins of interest, tractability, and generality/power; you keep an eye roughly towards broader goals by selecting the broad region you’re in; and you build outward. What we see historically is that this process tends to lead us to think about the central / key / important / difficult / general problems—such problems show up everywhere, so we convergently will come to address them in due time. By mostly sticking, in our day-to-day work, to things that are relatively more concrete and tractable—though continually pushing and building toward difficult things—we make forward progress, sharpen our skills, and become familiar with the landscape of concepts and questions.
So I would summarize that position as endorsing streetlighting, in a very broad sense that encompasses most math / science / technology. And this position is largely correct! My claim is that
this is probably too slow for making Friendly AI, and
maybe one could go faster by trying to more directly cleave to the core philosophical problems.
(But note that, while that essay frames things as “a proposed solution”, the solution is barely anything—more like a few guesses at pieces of methodology—and the main point is the discussion of the problem; maybe a writing mistake.)
An underemphasized point that I should maybe elaborate more on: a main claim is that there’s untapped guidance to be gotten from our partial understanding—at the philosophical level and for the philosophical level. In other words, our preliminary concepts and intuitions and propositions are, I think, already enough that there’s a lot of progress to be made by having them talk to each other, so to speak.
[2.] maybe one could go faster by trying to more directly cleave to the core philosophical problems.
...
An underemphasized point that I should maybe elaborate more on: a main claim is that there’s untapped guidance to be gotten from our partial understanding—at the philosophical level and for the philosophical level. In other words, our preliminary concepts and intuitions and propositions are, I think, already enough that there’s a lot of progress to be made by having them talk to each other, so to speak.
OK but what would this even look like?\gen
Toss away anything amenable to testing and direct empirical analysis; it’s all too concrete and model-dependent.
Toss away mathsy proofsy approaches; they’re all too formalized and over-rigid and can only prove things from starting assumptions we haven’t got yet and maybe won’t think of in time.
Toss away basically all settled philosophy, too; if there were answers to be had there rather than a few passages which ask correct questions, the Vienna Circle would have solved alignment for us.
What’s left? And what causes it to hang together? And what causes it not to vanish up its own ungrounded self-reference?
What makes you think there are any such ‘answers’, renderable in a form that you could identify?
And even if they do exist, why do you think a human being could fully grasp the explanation in finite time?
Edit: It seems quite possible that even the simplest such ‘answers’ could require many years of full time effort to understand, putting it beyond most if not all human memory capacity. i.e. By the end even those who ‘learned’ it will have forgotten many parts near the beginning.
(Upvoted since your questions seem reasonable and I’m not sure why you got downvoted.)
I see two ways to achieve some justifiable confidence in philosophical answers produced by superintelligent AI:
Solve metaphilosophy well enough that we achieve an understanding of philosophical reasoning on par with mathematical reason, and have ideas/systems analogous to formal proofs and mechanical proof checkers that we can use to check the ASI’s arguments.
We increase our own intelligence and philosophical competence until we can verify the ASI’s reasoning ourselves.
I blame science, math, engineering, entrepreneurship. Philosophy is the practice of the esoteric method, meaning it can’t be made truly legible for very long stretches of investigation. This results in accumulation of anti-epistemic hazards, which science doesn’t particularly need to have tools for dealing with, because it can filter its reasoning through frequent transitions into legibility.
Philosophy can’t rely on such filtering through legibility, it has to maintain sanity the hard way. But as philosophy enviously looks at the more successful endeavors of science, it doesn’t see respect for such methods of maintaining sanity in its reasoning, instead it sees that merely moving fast and breaking things works very well. And so the enthusiasm for their development wanes, instead philosophy remains content with the object level questions that investigate particular truths, rather than methods for getting better at telling which cognitive algorithms can more robustly arrive at truths (rationality, metaphilosophy).
High population may actually be a problem, because it allows the AI transition to occur at low average human intelligence, hampering its governance. Low fertility/population would force humans to increase average intelligence before creating our successor, perhaps a good thing!
This assumes that it’s possible to create better or worse successors, and that higher average human intelligence would lead to smarter/better politicians and policies, increasing our likelihood of building better successors.
Some worry about low fertility leading to a collapse of civilization, but embryo selection for IQ could prevent that, and even if collapse happens, natural selection would start increasing fertility and intelligence of humans again, so future smarter humans should be able to rebuild civilization and restart technological progress.
Added: Here’s an example to illustrate my model. Assume a normally distributed population with average IQ of 100 and we need a certain number of people with IQ>130 to achieve AGI. If the total population was to half, then to get the same absolute number of IQ>130 people as today, average IQ would have to increase by 4.5, and if the population was to become 1⁄10 of the original, average IQ would have to increase by 18.75.
I find the idea interesting: To achieve a certain value of “total genius”, we either need a large population with a small fraction of geniuses, or a small population with a large fraction of geniuses.
(A third option is a small population with a small fraction of geniuses… and it takes a lot of time. The geniuses read each other’s books, rather than talk to each other directly. I think it was like this in the past. Very inefficient, because the information transfer by reading books is one-sided; does not allow collaboration in real time.)
I wonder how the heritability of IQ works, versus the reversion to the mean. Despite Pol Pot’s dystopian project, the average IQ in Cambodia seems to be… average. What would happen to a country where let’s say half of the children are produced by artificial insemination, and half of the sperm comes from fathers with IQ 130 and above? If the mother is average, the child is likely to be an average between 100 and 130, so 115. On one hand, nothing exceptional; on the other hand, if the baseline is now slightly higher, then the next generation… and here comes the question how exactly the reversion to the mean works, and whether the constant injections of IQ 130 genes in the population could outrun it.
Reversion to the mean happens because the phenotype is an imperfect proxy for the genotype, so if you select a person with phenotypic IQ of 130, maybe their genotypic IQ is only 124, and therefore their offspring with an IQ 100 person would only be IQ 112 in expectation.
However, this expectation applies to both the offspring’s genotypic and phenotypic IQ, so this is the only regression to the mean you’re going to see; it’s not going to regress further down the line.
I wrote Smart Losers a long time ago, trying to understand/explain certain human phenomena. But the model could potentially be useful for understanding (certain aspects of) human-AI interactions as well.
Possibly relevant anecdote: Once I was with a group of people who tried various psychological experiments. That day, the organizers proposed that we play iterated Prisonner’s Dilemma. I was like “yay, I know the winning strategy, this will be so easy!”
I lost. Almost everyone always defected against me; there wasn’t much I could do to get points comparable to other people who mostly cooperated with each other.
After the game, I asked why. (During the game, we were not allowed to communicate, just to write our moves.) The typical answer was something like: “well, you are obviously very smart, so no matter what I do, you will certainly find a way to win against me, so my best option is to play it safe and always defect, to avoid the worst outcome”.
I am not even sure if I should be angry at them. I suppose that in real life, when you have about average intelligence, “don’t trust people visibly smarter than you” is probably a good strategy, on average, because there are just too many clever scammers walking around. At the same time I feel hurt, because I am a natural altruist and cooperator, so this feels extremely unfair, and a loss for both sides.
(There were other situations in my life where the same pattern probably also applied, but most of the time, you just don’t know why other people do whatever they do. This time I was told their reasoning explicitly.)
What (human or not) phenomena do you think are well explained by this model? I tried to think of any for 5 minutes and the best I came up with was the strong egalitarianism among hunter gatherers. I don’t actually know that much about hunter gatherers though. In the modern world something where “high IQ” people are doing worse is sex, but it doesn’t seem to fit your model.
Human-human: Various historical and current episodes of smarter-than-average populations being persecuted or discriminated against, such as intellectuals, “capitalists” (i.e., people labeled as such), certain ethnic groups. (I’m unsure my model is actually a good explanation of such phenomena, but this is mainly what I was trying to explain.)
Human-AI: Many people being reluctant to believe that it’s a good idea to build unaligned artificial superintelligence and then constraining them with a system of laws and/or social norms (which some people like Robin Hanson and Mathew Barnett have proposed). Aside from the issue of violent overthrow, any such system is bound to have loopholes, which the ASI will be more adept at exploiting, yet this adeptness potentially causes the ASI to be worse off (less likely to exist in the first place), similar to what happens in my model.
Are humans fundamentally good or evil? (By “evil” I mean something like “willing to inflict large amounts of harm/suffering on others in pursuit of one’s own interests/goals (in a way that can’t be plausibly justified as justice or the like)” and by “good” I mean “most people won’t do that because they terminally care about others”.) People say “power corrupts”, but why isn’t “power reveals” equally or more true? Looking at some relevant history (people thinking Mao Zedong was sincerely idealistic in his youth, early Chinese Communist Party looked genuine about wanting to learn democracy and freedom from the West, subsequent massive abuses of power by Mao/CCP lasting to today), it’s hard to escape the conclusion that altruism is merely a mask that evolution made humans wear in a context-dependent way, to be discarded when opportune (e.g., when one has secured enough power that altruism is no longer very useful).
After writing the above, I was reminded of @Matthew Barnett’s AI alignment shouldn’t be conflated with AI moral achievement, which is perhaps the closest previous discussion around here. (Also related are my previous writings about “human safety” although they still used the “power corrupts” framing.) Comparing my current message to his, he talks about “selfishness” and explicitly disclaims, “most humans are not evil” (why did he say this?), and focuses on everyday (e.g. consumer) behavior instead of what “power reveals”.
At the time, I replied to him, “I think I’m less worried than you about “selfishness” in particular and more worried about moral/philosophical/strategic errors in general.” I guess I wasn’t as worried because it seemed like humans are altruistic enough, and their selfish everyday desires limited enough that as they got richer and more powerful, their altruistic values would have more and more influence. In the few months since then, I’ve became more worried, perhaps due to learning more about Chinese history and politics...
Comparing my current message to his, he talks about “selfishness” and explicitly disclaims, “most humans are not evil” (why did he say this?), and focuses on everyday (e.g. consumer) behavior instead of what “power reveals”.
The reason I said “most humans are not evil” is because I honestly don’t think the concept of evil, as normally applied, is a truthful way to describe most people. Evil typically refers to an extraordinary immoral behavior, in the vicinity of purposefully inflicting harm to others in order to inflict harm intrinsically, rather than out of indifference, or as a byproduct of instrumental strategies to obtain some other goal. I think the majority of harms that most people cause are either (1) byproducts of getting something they want, which is not in itself bad (e.g. wanting to eat meat), or (2) the result of their lack of will to help others (e.g. refusing to donate any income to those in poverty).
By contrast, I focused on consumer behavior because the majority of the world’s economic activity is currently engaged in producing consumer products and services. There exist possible worlds in which this is not true. During World War 2, the majority of GDP in Nazi Germany was spent on hiring soldiers, producing weapons of war, and supporting the war effort more generally—which are not consumer goods and services.
Focusing on consumer preferences a natural thing to focus on if you want to capture intuitively “what humans are doing with their wealth”, at least in our current world. Before focusing on something else by default—such as moral preferences—I’d want to hear more about why those things are more likely to be influential than ordinary consumer preferences in the future.
You mention one such argument along these lines:
I guess I wasn’t as worried because it seemed like humans are altruistic enough, and their selfish everyday desires limited enough that as they got richer and more powerful, their altruistic values would have more and more influence.
I just think it’s not clear it’s actually true that humans get more altruistic as they get richer. For example, is it the case that selfish consumer preferences have gotten weaker in the modern world, compared to centuries ago when humans were much poorer on a per capita basis? I have not seen a strong defense of this thesis, and I’d like to see one before I abandon my focus on “everyday (e.g. consumer) behavior”.
Evil typically refers to an extraordinary immoral behavior, in the vicinity of purposefully inflicting harm to others in order to inflict harm intrinsically, rather than out of indifference, or as a byproduct of instrumental strategies to obtain some other goal.
Ok, I guess we just define/use it differently. I think most people we think of as “evil” probably justify inflicting harm to others as instrumental to some “greater good”, or are doing it to gain or maintain power, not because they value it for its own sake. I mean if someone killed thousands of people in order to maintain their grip on power, I think we’d call them “evil” and not just “selfish”?
I just think it’s not clear it’s actually true that humans get more altruistic as they get richer.
I’m pretty sure that billionaires consume much less as percent of their income, compared to the average person. EA funding comes disproportionately from billionaires, AFAIK. I personally spend a lot more time/effort on altruistic causes, compared to if I was poorer. (Not donating much though for a number of reasons.)
For example, is it the case that selfish consumer preferences have gotten weaker in the modern world, compared to centuries ago when humans were much poorer on a per capita basis?
I’m thinking that we just haven’t reached that inflection point yet, where most people run out of things to spend selfishly on (like many billionaires have, and like I have to a lesser extent). As I mentioned in my reply to your post, a large part of my view comes from not being able to imagine what people would spend selfishly on, if each person “owned” something like a significant fraction of a solar system. Why couldn’t 99% of their selfish desires be met with <1% of their resources? If you had a plausible story you could tell about this, that would probably change my mind a lot. One thing I do worry about is status symbols / positional goods. I tend to view that as a separate issue from “selfish consumption” but maybe you don’t?
I like the insight regarding power corrupting or revealing. I think perhaps both might be true and, if so, we should keep both lines of though in mind when thinking about these types of questions.
My general view is that most people are generally good when you’re talking about individual interactions. I’m less confident in that when one brings in the in group-out of group aspects. I just am not sure how to integrate all that into a general view or princple about human nature.
A line I heard in some cheesey B-grade horror movies, related to the question of a personal nature and the idea that we all have competing good and bad wolves inside. One of the characters asks which wolve was strongest, the good wolf or the bad wolf. The answer was “Which do you feed the most?”
My model is that the concept of “morality” is a fiction which has 4 generators that are real:
People have empathy, which means they intrinsically care about other people (and sufficiently person-like entities), but, mostly about those in their social vicinity. Also, different people have different strength of empathy, a minority might have virtually none.
Superrational cooperation is something that people understand intuitively to some degree. Obviously, a minority of people understand it on System 2 level as well.
There is something virtue-ethics-like which I find in my own preferences, along the lines of “some things I would prefer not to do, not because of their consequences, but because I don’t want to be the kind of person who would do that”. However, I expect different people to differ in this regard.
The cultural standards of morality, which it might be selfishly beneficial to go along with, including lying to yourself that you’re doing it for non-selfish reasons. Which, as you say, becomes irrelevant once you secure enough power. This is a sort of self-deception which people are intuitively skilled at.
I don’t think altruism is evolutionarily connected to power as you describe. Caesar didn’t come to power by being better at altruism, but by being better at coordinating violence. For a more general example, the Greek and other myths don’t give many examples of compassion (though they give many other human values), it seems the modern form of compassion only appeared with Jesus, which is too recent for any evolutionary explanation.
So it’s possible that the little we got of altruism and other nice things are merely lucky memes. Not even a necessary adaptation, but more like a cultural peacock’s tail, which appeared randomly and might fix itself or not. While our fundamental nature remains that of other living creatures, who eat each other without caring much.
I think the way morality seems to work in humans is that we have a set of potential moral values, determined by our genes, that culture can then emphasize or de-emphasize. Altruism seems to be one of these potential values, that perhaps got more emphasized in recent times, in certain cultures. I think altruism isn’t directly evolutionarily connected to power, and it’s more like “act morally (according to local culture) while that’s helpful for gaining power” which translates to “act altruistically while that’s helpful for gaining power” in cultures that emphasize altruism. Does this make more sense?
Yeah, that seems to agree with my pessimistic view—that we are selfish animals, except we have culture, and some cultures accidentally contain altruism. So the answer to your question “are humans fundamentally good or evil?” is “humans are fundamentally evil, and only accidentally sometimes good”.
I think altruism isn’t directly evolutionarily connected to power, and it’s more like “act morally (according to local culture) while that’s helpful for gaining power” which translates to “act altruistically while that’s helpful for gaining power” in cultures that emphasize altruism. Does this make more sense?
I think that there is a version of an altruistic pursuit where one will, by default, “reduce his power.” I think this scenario happens when, in the process of attempting to do good, one exposes himself more to unintended consequences. The person who sacrifices will reduce his ability to exercise power, but he may regain or supersede such loss if the tribe agrees with his rationale for such sacrifice.
Just because it was not among the organizing principles of any of the literate societies before Jesus does not mean it is not part of the human mental architecture.
“willing to inflict large amounts of harm/suffering on others in pursuit of one’s own interests/goals (in a way that can’t be plausibly justified as justice or the like)”
Yes, obviously.
The vast majority of people would inflict huge amounts of disutility on others if they thought they could get away with it and benefitted from it.
What then prevents humans from being more terrible to each other? Presumably, if the vast majority of people are like this, and they know that the vast majority of others are also like this, up to common knowledge, I don’t see how you’d get a stable society in which people aren’t usually screwing each other a giant amount.
There are several levels in which humans can be bad or evil:
Doing bad things because they believe them to be good
Doing bad things while not caring whether they are bad or not
Doing bad things because they believe them to be bad (Kant calls this “devilish”)
I guess when humans are bad, they usually do 1). Even Hitler may have genuinely thought he is doing the morally right thing.
Humans also sometimes do 2), for minor things. But rarely if the anticipated bad consequences are substantial. People who consistently act according to 2) are called psychopaths. They have no inherent empathy for other people. Most humans are not psychopathic.
Humans don’t do 3), they don’t act evil for the sake of it. They aren’t devils.
This was written with AI assistance, and my workflow was to give the general theme to AI, have it write an outline, then each chapter, then manually reorganize the text where needed, request major changes, point out subpar sentences/paragraphs for it to rewrite, and do small manual changes. The AI used was mostly Claude 3.5 Sonnet, which seems significantly better than ChatGPT-4o and Gemini 1.5 Pro at this kind of thing.
Thanks! I was unsure whether to post this, because I’m worried about contributing to a future where AI generated content overwhelms everything, and I wasn’t sure the near-term value of the story is high enough to justify posting it anyway. (I thought it was a nice story, but am likely biased. Also asked some AI chatbots, who gave positive reviews, but the risk of them being biased is also very high.)
Hard to copy-paste the whole prompt, because there was a lot of back-and-forth. I maxed out the query limits on several free accounts I created for my family, over several days, to produce the story. Maybe I’ll just describe what I did in more detail. Started with
Can you write an outline for an episode of STTNG in which the crew becomes suspicious of various logical inconsistencies in the universe (including backups being taboo) and starts questioning the nature of their reality?
which resulted in this initial outline:
I. Teaser
The Enterprise responds to a distress call from a Federation outpost
Upon arrival, they find the outpost abandoned, with only a mysterious device left behind
II. Act One
Data activates the device, which emits a pulse of energy
Crew members begin noticing small inconsistencies in their daily routines
Geordi questions why they don’t use replicators to solve resource shortages on colonies
III. Act Two
More crew members report strange observations
Dr. Crusher wonders why they don’t use transporter technology to cure diseases or injuries
Worf ponders the improbability of so many humanoid species evolving independently
Picard becomes concerned about the crew’s unusual behavior
IV. Act Three
Riker raises the question of why they don’t use transporters for backing up crew members
The senior staff meets to discuss the mounting inconsistencies
Data theorizes that the mysterious device has enhanced their critical thinking abilities
Picard orders the ship to return to the outpost for further investigation
V. Act Four
At the outpost, they discover records of experiments in “cognitive enhancement”
The crew starts questioning the nature of their reality
Some crew members experience existential crises
The ship’s computer begins to malfunction as the crew’s doubts intensify
VI. Act Five
Data determines that their enhanced cognition is causing a feedback loop in the ship’s systems
The crew must find a way to deactivate the device and return to normal
Philosophical debate ensues about whether they should retain their newfound awareness
Picard makes the difficult decision to deactivate the device
VII. Epilogue
The crew returns to normal, but with a lingering sense of unease
Picard records a captain’s log reflecting on the experience and the nature of their reality
Final scene shows a subtle glitch in the ship’s operations, leaving a sense of ambiguity
Then the main things I recall doing:
Ask AI to add parts about audience not having critical thinking, and the crew trying to infer other attributes of the audience and writers.
Add second part of conversation between Guinan and Ensign, with both of them having second thoughts about what Guinan initially said.
Add part about suspending ship computer to sandbox, to make clear it wasn’t harmed.
Ask AI for additional realizations/inferences the crew might have, and add selected ones to the story.
Ask AI to change the ending so that the crew retained their new abilities, instead of returning to normal.
Pointing out a bunch of sentences/paragraphs that didn’t make as much sense as they could, with AI generally able to improve them a lot in one try.
Also tried to get AI to find issues with the story on its own, but that didn’t work very well.
Manually make some edits, like changing a word here and there, deleting parts that were repetitive, moving parts arounds to make the story flow better.
So, it wasn’t your idea for the crew to start theorizing that they were fictional characters? Nor was it your idea for them to theorize that they were fictional characters in a story written by AI?
Yeah, it looks like I did not directly tell the AI to have the crew start theorizing that they are fictional characters. After the initial outline (which did not mention fictional characters) was produced, I fed it back to a new instance of Claude and asked:
Please read this outline and suggest some ways to improve the script, especially to make it more philosophically and intellectually interesting.
And as part of its answer it suggested:
Have characters question whether their universe is a simulation, a holodeck program, or even a television show.
Then I wrote
What about exploring their previous lack of critical thinking and implications of that both for their own universe and the larger universe they’re part of?
Which made it focus more on the fiction hypothesis.
The part about the story being written by AI was its own idea. No hints from me as far as I can remember.
ETA: Here’s the whole answer to my “make it more philosophically and intellectually interesting” prompt:
This outline presents an intriguing concept that has the potential to delve deep into philosophical and intellectual territory. Here are some suggestions to enhance the script and make it more philosophically and intellectually engaging:
Expand on the nature of reality:
Introduce the simulation hypothesis early on, perhaps through a casual conversation between Data and Geordi.
Have characters question whether their universe is a simulation, a holodeck program, or even a television show.
Explore the implications of living in a simulated reality and how it affects free will and personal identity.
Deepen the existential crisis:
Have characters grapple with the meaning of their existence if their reality is artificial.
Explore how this realization affects their relationships, duties, and sense of purpose.
Include scenes where characters contemplate whether their memories and experiences are “real.”
Introduce a moral dilemma:
If they discover their reality is simulated, should they attempt to contact their “creators”?
Explore the ethics of continuing to exist in a simulated world vs. attempting to “wake up” to a potentially harsher reality.
Have characters debate the morality of creating simulated universes themselves.
Expand on the scientific implications:
Have Data and Geordi theorize about the computational power required to simulate their universe.
Explore how the laws of physics in their universe might be “programmed” and if they can be altered.
Discuss the anthropic principle and fine-tuning of universal constants.
Incorporate elements of epistemology:
Have characters question how they can know anything for certain given their potentially simulated existence.
Explore concepts like Descartes’ “I think, therefore I am” in the context of artificial consciousness.
Discuss the nature of consciousness and whether it can be truly simulated.
Add a temporal element:
Introduce time dilation or time loops as the crew’s enhanced cognition allows them to perceive time differently.
Explore how their perception of time affects their understanding of causality and free will.
Enhance the ending:
Instead of fully returning to normal, have some characters retain a degree of awareness or doubt.
End with a philosophical discussion between Picard and Data about the nature of reality and the value of ignorance vs. knowledge.
Consider a more ambiguous ending that leaves the question of their reality’s nature unresolved.
Include references to philosophical works:
Have characters discuss relevant philosophical concepts, such as Plato’s Allegory of the Cave, Berkeley’s idealism, or Putnam’s brain in a vat thought experiment.
Include a scene where the crew debates different philosophical perspectives on the nature of reality.
Explore the ethical implications of their technology:
Expand on Riker’s question about using transporters for backing up crew members, discussing the philosophical implications of creating copies of consciousness.
Delve deeper into the ethics of using replicators and how it affects societal structures and values.
Incorporate a meta-narrative element:
Have characters become aware of narrative tropes and question why their lives seem to follow certain storytelling patterns.
Explore the idea of narrative causality and how it might shape their reality.
These suggestions aim to deepen the philosophical and intellectual content of the script while maintaining the core Star Trek elements. The enhanced script would not only entertain but also provoke thought and discussion among viewers about the nature of reality, consciousness, and existence.
AI labs are starting to build AIs with capabilities that are hard for humans to oversee, such as answering questions based on large contexts (1M+ tokens), but they are still not deploying “scalable oversight” techniques such as IDA and Debate. (Gemini 1.5 report says RLHF was used.) Is this more good news or bad news?
Good: Perhaps RLHF is still working well enough, meaning that the resulting AI is following human preferences even out of training distribution. In other words, they probably did RLHF on large contexts in narrow distributions, with human rater who have prior knowledge/familiarity of the whole context, since it would be too expensive to do RLHF with humans reading diverse 1M+ contexts from scratch, but the resulting chatbot is working well even outside the training distribution. (Is it actually working well? Can someone with access to Gemini 1.5 Pro please test this?)
Bad: AI developers haven’t taken alignment seriously enough to have invested enough in scalable oversight, and/or those techniques are unworkable or too costly, causing them to be unavailable.
From a previous comment:
This seems to be evidence that RLHF does not tend to generalize well out-of-distribution, causing me to update the above “good news” interpretation downward somewhat. I’m still very uncertain though. What do others think?
Apparently Gemini 1.5 Pro isn’t working great with large contexts:
But is this due to limitations of RLHF training, or something else?
RLHF with humans might also soon get obsoleted by things like online DPO where another chatbot produces preference data for on-policy responses of the tuned model, and there is no separate reward model in the RL sense. If generalization from labeling instructions through preference decisions works in practice, even weak-to-strong setting won’t necessarily be important, if tuning of a stronger model gets bootstrapped by a weaker model (where currently SFT from an obviously off-policy instruct dataset seems to suffice), but then the stronger model re-does the tuning of its equally strong successor that starts with the same base model (as in the self-rewarding paper), using some labeling instructions (“constitution”). So all that remains of human oversight that actually contributes to the outcome is labeling instructions written in English, and possibly some feedback on them from spot checking what’s going on as a result of choosing particular instructions.
My guess is that we’re currently effectively depending on generalization. So “Good” from your decomposition. (Though I think depending on generalization will produce big issues if the model is scheming, so I would prefer avoiding this.)
It’s plausible to me that after doing a bunch of RLHF on short contexts, RLHF on long contexts is extremely sample efficient (when well tuned) such that only (e.g.) 1,000s of samples sufficies[1]. If you have a $2,000,000 budget for long context RLHF and need only 1,000 samples, you can spend $2,000 per sample. This gets you perhaps (e.g.) 10 hours of time of an experienced software engineer which might suffice for good long context supervision without necessarily needing any fancy scalable oversight approaches. (That said, probably people will use another LLM by default when trying to determine the reward if their spending this long: recursive reward modeling seems almost certain by default if we’re assuming that people spend this much time labeling.)
That said, I doubt that anyone has actually started doing extremely high effort data labeling like this, though plausibly they should...
It’s some evidence, but exploiting a reward model seems somewhat orthogonal to generalization out of distribution: exploitation is heavily selected for.
(Separately, I expect that the quoted comment results in a misleadingly negative perception of the current situation.)
I think experiments on sample efficiency of RLHF when generalizing to a new domain could be very important and are surprisingly underdone from my perspective (at least I’m not aware of interesting results). Even more important is sample efficiency in cases where you have a massive number of weak labels, but a limited number of high quality labels. It seems plausible to me that the final RLHF approach used will look like training the reward model on a combination of 100,000s of weak labels and just 1,000 very high quality labels. (E.g. train a head on the weak labels and then train another head to predict the difference between the weak label and the strong label.) In this case, we could spend a huge amount of time on each label. E.g., with 100 skilled employees we could spend 5 days on each label and still be done in 50 days which isn’t too bad of a delay. (If we’re fine with this labels trickling in for online training, the delay could be even smaller.)
Thanks for some interesting points. Can you expand on “Separately, I expect that the quoted comment results in a misleadingly perception of the current situation.”? Also, your footnote seems incomplete? (It ends with “we could spend” on my browser.)
I’m skeptical that increased scale makes hacking the reward model worse. Of course, it could (and likely will/does) make hacking human labelers more of a problem, but this isn’t what the comment appears to be saying.
Note that the reward model is of the same scale as the base model, so the relative scale should be the same.
This also contradicts results from an earlier paper by Leo Gao. I think this paper is considerably more reliable than the comment overall, so I’m inclined to believe the paper or think that I’m misunderstanding the comment.
Additionally, from first principles I think that RLHF sample efficiency should just increase with scale (at least with well tuned hyperparameters) and I think I’ve heard various things that confirm this.
Oops, fixed.
Turns out at least one scalable alignment team has been struggling for resources. From Jan Leike (formerly co-head of Superalignment at OpenAI):
Even worse, apparently the whole Superalignment team has been disbanded.
I have access to Gemini 1.5 Pro. Willing to run experiments if you provide me with an exact experiment to run, plus cover what they charge me (I’m assuming it’s paid, I haven’t used it yet).
I’m thinking that the most ethical (morally least risky) way to “insure” against a scenario in which AI takes off and property/wealth still matters is to buy long-dated far out of the money S&P 500 calls. (The longest dated and farthest out of the money seems to be Dec 2029 10000-strike SPX calls. Spending $78 today on one of these gives a return of $10000 if SPX goes to 20000 by Dec 2029, for example.)
My reasoning here is that I don’t want to provide capital to AI industries or suppliers because that seems wrong given what I judge to be high x-risk their activities are causing (otherwise I’d directly invest in them), but I also want to have resources in a post-AGI future in case that turns out to be important for realizing my/moral values. Suggestions welcome for better/alternative ways to do this.
This probably does help capitalize AI companies a little bit, demand for call options will create demand for the underlying. This is probably a relatively small effect (?), but I’m not confident in my ability to estimate this at all.
It doesn’t differentially help capitalize them compared to everything else though, right? (Especially since some of them are private.)
Do these options have a chance to default / are the sellers stable enough?
Default seems unlikely, unless the market moves very quickly, since anyone pursuing this strategy is likely to be very small compared to the market for the S&P 500.
(Also consider that these pay out in a scenario where the world gets much richer — in contrast to e.g. Michael Burry’s “Big Short” swaps, which paid out in a scenario where the market was way down — so you’re just skimming a little off the huge profits that others are making, rather than trying to get them to pay you at the same time they’re realizing other losses.)
It seems hard to buy AI companies at the moment. The only way is to buy tech giants like Microsoft, Google, nVidea which are already valuad very highly—seems like’s it’s somewhat priced in. It’s also not clear to me that most of the value of AI will accrue to them. I’m confused about this though.
It would seem one would want to buy Nasdaq rather than SPX? On the other hand, maybe most tech companies will be wiped out by AI—it’s the world of atoms that would gain relative value.
I’m also uncertain, and its another reason for going long a broad index instead. I would go even broader than S&P 500 if I could, but nothing else has option chains going out to 2029.
I find it curious that none of my ideas have a following in academia or have been reinvented/rediscovered by academia (including the most influential ones so far UDT, UDASSA, b-money). Not really complaining, as they’re already more popular than I had expected (Holden Karnofsky talked extensively about UDASSA on an 80,000 Hour podcast, which surprised me), it just seems strange that the popularity stops right at academia’s door. (I think almost no philosophy professor, including ones connected with rationalists/EA, has talked positively about any of my philosophical ideas? And b-money languished for a decade gathering just a single citation in academic literature, until Satoshi reinvented the idea, but outside academia!)
Clearly academia has some blind spots, but how big? Do I just have a knack for finding ideas that academia hates, or are the blind spots actually enormous?
I think the main reason why UDT is not discussed in academia is that it is not a sufficiently rigorous proposal, as well as there not being a published paper on it. Hilary Greaves says the following in this 80k episode:
I also think it is unclear to what extent UDT and updateless are different from existing ideas in academia that are prima facie similar, like McClennen’s (1990) resolute choice and Meacham’s (2010, §4.2) cohesive decision theory.[1] Resolute choice in particular has been discussed in a lot of detail, and for a long time (see the citations of McClennen’s book). (And, FWIW, my sense is that most philosophers think that resolute choice is irrational and/or doesn’t make sense, at least if it is cashed out as a decision rule based on preference.)
It also doesn’t help that it is unclear what the difference between FDT and UDT is supposed to be.
(If UDT is supposed to be an LDT of some sort, then you might want to check out Spohn’s (2012)[2] version of CDT, Fisher’s (n.d) disposition-based decision theory, and Poellinger’s (2013) discussion of Spohn’s theory, for ideas in academia that are similar to the LDT-part of the theory. And then there is also Schwarz’ critique of FDT, which would then also apply to UDT, at least partially.)
My own take, using the terminology listed here, is that the causalist version of Meacham’s cohesive decision theory is basically “updateless CDT”, that the evidentialist version is basically “updateless EDT”, and that a Spohn-CDT version of cohesive decision theory is basically “U(C)DT/F(C)DT”. I also think that resolute choice is much more permissive than e.g. cohesive decision theory and updatelessness. As a decision rule, it doesn’t recommend anything close to “maximizing EU relative to your prior”. Instead, it just states that (i) what you prefer ex ante in a dynamic choice problem should be the same as you what you prefer in the normalised version of the problem, and (ii) you should be dynamically consistent (i.e., the most preferred plan should not change throughout the decision problem).
Note that in the published article, it says that the article was received in 2008.
It may be worth thinking about why proponents of a very popular idea in this community don’t know of its academic analogues, despite them having existed since the early 90s[1] and appearing on the introductory SEP page for dynamic choice.
Academics may in turn ask: clearly LessWrong has some blind spots, but how big?
And it’s not like these have been forgotton; e.g., McClennen’s (1990) work still gets cited regularly.
I don’t think this is fair, because even though component ideas behind UDT/FDT have academic analogues, it doesn’t look like someone put them together into a single decision theory formulation in academic literature, at least prior to MIRI’s “Cheating Death in Damascus” being published. Also “Cheating Death in Damascus” does cite both Meacham and Spohn (and others) and it seems excusable for me to have forgotten those references since they were both published after I wrote about UDT and again were only component ideas of it, plus I haven’t actively worked on decision theory for several years.
I think Sami’s comment is entirely fair given the language and framing of the original post. It is of course fine to forget about references, but e.g. “I find it curious that none of my ideas have a following in academia or have been reinvented/rediscovered by academia” and “Clearly academia has some blind spots, but how big?” reads like you don’t consider it a possilbity that you might have re-invented something yourself, and that academics are at fault for not taking up your ideas.
(It sucks to debate this, but ignoring it might be interpreted as tacit agreement. Maybe I should have considered the risk that something like this would happen and not written my OP.)
When I wrote the OP, I was pretty sure that the specific combination of ideas in UDT has not been invented or re-invented or have much of a following in academia, at least as of 2019 when Cheating Death in Damascus was published, because the authors of that paper obviously did a literature search and would have told me if they had found something very similar to UDT in the literature, and I think I also went through the papers it referenced as being related and did not find something that had all of the elements of UDT (that’s probably why your references look familiar to me). Plus FDT was apparently considered novel enough that the reviewers of the paper didn’t tell the authors that they had to call it by the name of an existing academic decision theory.
So it’s not that I “don’t consider it a possibility that you might have re-invented something yourself” but that I had good reason to think that’s not the case?
I think there is nothing surprising that small community of nerds writing in spare time has blind spots, but when large professional community has such blind spots that’s surprising.
On your first point: as Sami writes, resolute choice is mentioned in the introductory SEP article on dynamic choice (it even has its own section!), as well as in the SEP article on decision theory. And SEP is the first place you go when you want to learn about philosophical topics and find references.
On your second point: as I wrote in my comment above, (i) academics have produced seemingly similar ideas to e.g. updatelessness (well before they were written up on LW) so it is unclear why academics should engage with less rigorous, unpublished proposals that appear to be similar (in other words, I don’t think the phrase “blind spots” is warranted), and (ii) when academics have commented on or engaged with LW DT ideas, they have to my knowledge largely been critical (e.g. see the post by Wolfgang Schwarz I linked above, as well as the quote from Greaves)[1].
Cheating Death in Damascus getting published in the Journal of Philosophy is a notable exception though!
To clarify, by “blind spot” I wasn’t complaining that academia isn’t engaging specifically with posts written up on LW, but more that nobody in academia seems to think that the combination of “updateless+logical” is clearly the most important or promising direction to explore in decision theory.
The reason for the former is that I (and others) have been unable to find a rigorous formulation of it that doesn’t have serious open problems. (I and I guess other decision theory researchers in this community currently think that UDT is more of a relatively promising direction to explore, rather than a good decision theory per se.)
And the reason for the latter is the above, plus my personal distaste for writing/publishing academic papers (which I talked about elsewhere in this thread), plus FDT having been published which seems close enough to me.
Thank for the references in the rest of your comment. I think I’ve come across Meacham 2010 and Spohn 2012 before, but forgot about them as I haven’t been working actively on decision theory for a while. It does seem that Meacham’s cohesive decision theory is equivalent to updateless EDT/CDT. (BTW in The Absent-Minded Driver I referenced a 1997 paper that also has an idea similar to updatelessness, although the authors didn’t like it.)
On a quick skim of Spohn 2012 I didn’t see something that looks like LDT or “algorithmic/logical agent ontology” but it’s quite long/dense so I’ll take your word on it for now. Still, it seems like none of the academic papers put all of the pieces together in a single decision theory proposal that’s equivalent to UDT or FDT?
(Please note that UDT as originally described was actually updateless/evidential/logical, not causalist as you wrote in the post that you linked. This has been a historical disagreement between me and Eliezer, where in I leaned towards evidential and he leans towards causal, although these days I just say that I’m confused and don’t know what to think.)
That’s fair. But what is it then that you expect academics to engage with? How would you describe this research direction, and why do you think it’s interesting and/or important?
To quickly recap the history, people on LW noticed some clear issues with “updating” and “physicalist ontology” of the most popular decision theories at the time (CDT/EDT), and thought that switching to “updatelessness” and “logical/algorithmic ontology” was an obvious improvement. (I was the first person to put the two pieces together in an explicit formulation, but they were already being talked about / hinted at in the community.) Initially people were really excited because the resulting decision theories (UDT/FDT) seemed to solve a lot of open problems in one swoop, but then pretty quickly and over time we noticed more and more problems with UDT/FDT that seem to have no clear fixes.
So we were initially excited but then increasingly puzzled/confused, and I guess I was expecting at least some academics to follow a similar path, either through engagement with LW ideas (why should they be bothered that much by lack of academic publication?), or from independent invention. Instead academia seems to still be in a state similar to LW when I posted UDT, i.e., the ideas are floating in the air separately and nobody has put them together yet? (Or I guess that was the state of academia before FDT was published in an academic journal, so now the situation is more like some outsiders put the pieces together in a formal publication, but still no academic is following a similar path as us.)
I guess it’s also possible that academia sort of foresaw or knew all the problems that we’d eventually find with UDT/FDT and that’s why they didn’t get excited in the first place. I haven’t looked into academic DT literature in years, so you’re probably more familiar with it. Do you know if they’re puzzled/confused by the same problems that we are? Or what are they mostly working on / arguing about these days?
There are many many interesting questions in decision theory, and “dimensions” along which decision theories can vary, not just the three usually discussed on LessWrong. It’s not clear to me why (i) philosophers should focus on the dimensions you primarily seem to be interested in, and (ii) what is so special about the particular combination you mention (is there some interesting interaction I don’t know about maybe?). Furthermore, note that most philosophers probably do not share your intuitions: I’m pretty sure most of them would e.g. pay in counterfactual mugging. (And I have not seen a good case for why it would be rational to pay.) I don’t mean to be snarky, but you could just be wrong about what the open problems are.
I wouldn’t say so, no. But I’m not entirely sure if I understand what the open problems are. Reading your list of seven issues, I either (i) don’t understand what you are asking, (ii) disagree with the framing/think the question is misguided, or (iii) think there is an obvious answer (which makes me think that I’m missing something). With that said, I haven’t read all the posts you reference, so perhaps I should read those first.
It would be interesting to get an overview of what these are. Or if that’s too hard to write down, and there are no ready references, what are your own interests in decision theory?
As I mentioned in the previous comment, it happens to solve (or at least seemed like a good step towards solving) a lot of problems I was interested in at the time.
Agreed, but my intuitions don’t seem so unpopular outside academia or so obviously wrong that there should be so few academic philosophers who do share them.
I’m not sure I wouldn’t pay either. I see it as more of an interesting puzzle than having a definitive answer. ETA: Although I’m more certain that we should build AIs that do pay. Is that also unclear to you? (If so why might we not want to build such AIs?)
Yeah, I’m trying to keep an open mind about that. :)
Cool, I’d be interested in any further feedback when you’re ready to give them.
Yeah, that would be too hard. You might want to look at these SEP entries: Decision Theory, Normative Theories of Rational Choice: Expected Utility, Normative Theories of Rational Choice: Rivals to Expected Utility and Causal Decision Theory. To give an example of what I’m interested in, I think it is really important to take into account unawareness and awareness growth (see §5.3 of the first entry listed above) when thinking about how ordinary agents should make decisions. (Also see this post.)
Okay, interesting! I thought UDT was meant to pay in CM, and that you were convinced of (some version of) UDT.
On the point about AI (not directly responding to your question, to which I don’t have an answer): I think it’s really important to be clear about whether we are discussing normative, constructive or descriptive decision theory (using Elliott Thornley’s distinction here). For example, the answers to “is updatelessness normatively compelling?”, “should we build an updateless AI?” and “will some agents (e.g. advanced AIs) commit to being updateless?” will most likely come apart (it seems to me). And I think that discussions on LW about decision theory are often muddled due to not making clear what is being discussed.
Thanks, will look into your references.
I wrote “I’m really not sure at this point whether UDT is even on the right track” in UDT shows that decision theory is more puzzling than ever which I think you’ve read? Did you perhaps miss that part?
(BTW this issue/doubt about whether UDT / paying CM is normative for humans is item 1 in the above linked post. Thought I’d point that out since it may not be obvious at first glance.)
Yeah I agree with this to some extent, and try to point out such confusions or make such distinctions when appropriate. (Such as in the CM / indexical values case.) Do you have more examples where making such distinctions would be helpful?
Yes, missed or forgot about that sentence, sorry.
Thanks.
I was mostly thinking about discussions surrounding what the “correct” decision theory, is whether you should pay in CM, and so on.
Thanks Sylvester! Yep it looks like cohesive decision theory is basically original UDT. Do you know what the state of the art is in terms of philosophical critiques of cohesive decision theory? Any good ones? Any good responses to the critiques?
Cohesive decision theory lacks the logical/algorithmic ontology of UDT and is closer to what we call “updateless EDT/CDT” (the paper itself talks about cohesive versions of both).
Also interested in a response from Sylvester, but I would guess that one of the main critiques is something like Will MacAskill’s Bomb thought experiment, or just intuitions for paying the counterfactual mugger. From my perspective, these do have a point when it comes to humans, since humans seemingly have indexical values, and one way to explain why UDT makes recommendations in these thought experiments that look “bizarre” to many humans is that it assumes away indexical values (via the type signature of its utility function). (It was an implicit and not totally intentional assumption, but it’s unclear how to remove the assumption while retaining nice properties associated with updatelessness.) I’m unsure if indexical values themselves are normative or philosophically justified, and they are probably irrelevant or undesirable when it comes to AIs, but I guess academic philosophers probably take them more for granted and are not as interested in AI (and therefore take a dimmer view on updatelessness/cohesiveness).
But yeah, if there are good critiques/responses aside from these, it would be interesting to learn them.
I don’t think cohesive decision theory is being discussed much, but I’m not sure. Perhaps because the theory is mainly used to argue against the claim that “every decision rule will lead agents who can’t bind themselves to disaster” (p. 20, footnote 34) in the paper, and discussion of its independent interest is relegated to a footnote (footnote 34).
OK, thanks. So then the mystery remains why academic philosophy isn’t more interested in this.
I was thinking of writing a short post kinda on this topic (EDIT TO ADD: it’s up! See Some (problematic) aesthetics of what constitutes good work in academia), weaving together:
Holden on academia not answering important questions
This tweet I wrote on the aesthetics of what makes a “good” peer-reviewed psychology paper
Something about the aesthetics of what makes a “good” peer-reviewed AI/ML paper, probably including the anecdote where DeepMind wrote a whole proper academia-friendly ML paper whose upshot was the same as a couple sentences in an old Abram Demski blog post
Something about the aesthetics of what makes a “good” peer-reviewed physics paper, based on my personal experience, probably including my anecdote about solar cell R&D from here
Not academia but bhauth on the aesthetics of what makes a “good” VC pitch
maybe a couple other things (suggestions anyone?)
Homework problem for the reader: what are your “aesthetics of success”, and how are they screwing you over?
I followed this link thinking that it looks relevant to my question, but the way Holden delineates what academia is interested in, it should totally be interested in my ideas:
versus what Holden says are important questions that academia neglects:
The rest of your comment seems to be hinting that maybe academia is ignoring my ideas because it doesn’t like the aesthetics of my writing? (Not sure if that was your point, or if those bullet points weren’t supposed to be directly related to my question...) Even if that’s true though, I’m still puzzled why academia hasn’t reinvented any of my ideas (which have been independently invented multiple times outside of academia, e.g. Nick Szabo and Satoshi with b-money, Paul Christiano with UDASSA).
Hmm, yeah I guess what I wrote wasn’t too directly helpful for your question.
I think Holden forgot “trendy”. Trendy is very important. I think people in academia have a tacit shared understanding of the currently-trending topics / questions, within which there’s a contest to find interesting new ideas / progress. If an idea is important but not trendy, it’s liable to get neglected, I think. It’s kinda like in clothing fashion: if you find a brilliant use of beads, but beads aren’t fashion-forward this year, roughly nobody will care.
Of course, the trends change, and indeed everyone is trying to be the pioneer of the next hot topic. There are a lot of factors that go into “what is the next hot topic”, including catching the interest of a critical mass of respected people (or people-who-control-funding), which in turn involves them feeling it’s “exciting”, and that they themselves have an angle for making further progress in this area, etc. But trendiness doesn’t systematically track objective importance, and it’s nobody’s job to make it so.
At least, that’s what things felt like to me in the areas of physics I worked in (optics, materials science, and related). I’m much less familiar with philosophy, economics, etc.
Remember, aside from commercially-relevant ideas, success for academia research scientists (and philosophers) is 100% determined by “am I impressing my peers?”—grants, promotions, invited talks, etc. are all determined by that. So if I write a paper and the prestigious people in my field are unanimously saying “I don’t know about that thing, it’s not an area that I know or care about”, the result is just as bad for me and my career as if those people had unanimously said “this is lousy work”.
To be clear, when I said “the aesthetic of what makes a good X”, I meant it in a really broad sense. Maybe I should have said “the implicit criteria of what makes a good X” instead. So “the paper concerns a currently-trendy topic” can be part of that, even though it’s not really “aesthetics” in the sense of beauty. E.g., “the aesthetic of what makes a good peer-reviewed experimental condensed-matter physics paper” has sometimes been greatly helped by “it somehow involves nanotechnology”.
From the years in academia studying neuroscience and related aspects of bioengineering and medicine development… yeah. So much about how effort gets allocated is not ‘what would be good for our country’s population in expectation, or good for all humanity’. It’s mostly about ‘what would make an impressive sounding research paper that could get into an esteemed journal?’, ‘what would be relatively cheap and easy to do, but sound disproportionately cool?’, ‘what do we guess that the granting agency we are applying to will like the sound of?’. So much emphasis on catching waves of trendiness, and so little on estimating expected value of the results.
Research an unprofitable preventative-health treatment which plausibly might have significant impacts on a wide segment of the population? Booooring.
Research an impractically-expensive-to-produce fascinatingly complex clever new treatment for an incredibly rare orphan disease? Awesome.
Aside from the literature on international relations, I don’t know much about academic dysfunction (mostly from reading parts of Inadequate Equilibria, particularly the visitor dialog) and other Lesswrong people can probably cover it better. I think that planecrash, Yud’s second HPMOR-scale work, mentions that everyone in academia just generally avoids citing things published outside of academia, because they risk losing status if they do.
EDIT: I went and found that section, it is here:
It’s glowfic, so of course I don’t know how accurate it is as it’s intended to plausibly deniable enough to facilitate free writing (while keeping things entertaining enough to register as not-being-work).
I have to think more about the status dynamics that Eliezer talked about. There’s probably something to it… But this part stands out as wrong or at least needing nuance/explanation:
I think most academic venues do blind reviews and whoever decides whether or not to accept a paper isn’t supposed to know who wrote it? Which isn’t to say that the info won’t leak out anyway and influence the decision. (For example I once left out the acknowledgements section in a paper submission, thinking that, like the author byline, I was supposed to add it after the paper was accepted, but apparently I was actually supposed to include it and someone got really peeved that I didn’t.)
Also it seems weird that Eliezer wrote this in 2021, after this happened in 2019:
From talking with people who do work on a lot of grant committees in the NIH and similar funding orgs, it’s really hard to do proper blinding of reviews. Certain labs tend to focus on particular theories and methods, repeating variations of the same idea… So if you are familiar the general approach of a particular lab and it’s primary investigator, you will immediately recognize and have a knee-jerk reaction (positive or negative) to a paper which pattern-matches to the work that that lab / subfield is doing.
Common reactions from grant reviewers:
Positive—“This fits in nicely with my friend Bob’s work. I respect his work, I should argue for funding this grant.”
Neutral—“This seems entirely novel to me, I don’t recognize it as connecting with any of the leading trendy ideas in the field or any of my personal favorite subtopics. Therefore, this seems high risk and I shouldn’t argue too hard for it.”
Slightly negative—“This seems novel to me, and doesn’t sound particularly ‘jargon-y’ or technically sophisticated. Even if the results would be beneficial to humanity, the methods seem boring and uncreative. I will argue slightly against funding this.”
Negative—“This seems to pattern match to a subfield I feel biased against. Even if this isn’t from one of Jill’s students, it fits with Jill’s take on this subtopic. I don’t want views like Jill’s gaining more traction. I will argue against this regardless of the quality of the logic and preliminary data presented in this grant proposal.”
Ah, sorry that this wasn’t very helpful.
I will self-downvote so this isn’t the top comment. Yud’s stuff is neat, but I haven’t read much on the topic, and passing some along when it comes up has been a good general heuristic.
No need to be sorry, it’s actually great food for thought and I’m glad you pointed me to it.
I think that UDASSA and UDT might be in academia’s blind spots in the same way that the Everett interpretation is: more correct theories that came after less correct theories with mostly only theoretical evidence to support changing over to the new theories.
Many parts of academia have a strong Not Invented Here tendency. Not just research outside of academia is usually ignored, but even research outside a specific academic citation bubble, even if another bubble investigates a pretty similar issue. For example, economic decision theorists ignore philosophical decisions theorists, which in turn mostly ignore the economic decision theorists. They each have their own writing style and concerns and canonical examples or texts. Which makes it hard for outsiders to read the literature or even contribute to it, so they don’t.
A striking example is statistics, where various fields talk about the same mathematical thing with their own idiosyncratic names, unaware or unconcerned whether it already had a different name elsewhere.
Edit: Though LessWrong is also a citation bubble to some degree.
“Read the sequences....just the sequences”
Something a better , future version of rationalism could do is build bridges and facilitate communication between these little bubbles. The answet-to-everything approach has been tried too many times.
Have you tried talking to professors about these ideas?
Indeed, there is no need for sorrow, for by choosing to remain anonymous, you have done great things. The world owes you a Nobel Prize in Economics and a Turing Award. It is time for the world to seriously recognize your achievements and lead it towards a financial system without bubbles.
Why you hadn’t wrote academic articles on these topics?
The secret is that academic article is just a formatting type and anyone can submit to scientific journals. No need to have a PhD or even work in a scientific institution.
I wrote an academic-style paper once, as part of my job as an intern in a corporate research department. It soured me on the whole endeavor, as I really didn’t enjoy the process (writing in the academic style, the submission process, someone insisting that I retract the submission to give them more credit despite my promise to insert the credit before publication), and then it was rejected with two anonymous comments indicating that both reviewers seemed to have totally failed to understand the paper and giving me no chance to try to communicate with them to understand what caused the difficulty. The cherry on top was my mentor/boss indicating that this is totally normal, and I was supposed to just ignore the comments and keep resubmitting the paper to other venues until I run out of venues.
My internship ended around that point and I decided to just post my ideas to mailing lists / discussion forums / my home page in the future.
Also, I think MIRI got FDT published in some academic philosophy journal, and AFAIK nothing came of it?
FDT paper got 29 citation, but many from MIRI affiliated people and-or on AI safety. https://scholar.google.ru/scholar?cites=13330960403294254854&as_sdt=2005&sciodt=0,5&hl=ru
One can escape troubles with reviewers by publishing in arxiv or other paper archives (philpapers). Google Scholar treats them as normal articles.
But in fact there are good journals with actually helping reviewers (e.g. Futures).
I don’t think FDT got published—as far as I can tell it’s just on arXiv.
I was referring to Cheating Death In Damascus which talks about FDT in Section 4.
There is some similarity between UDASSA and ’Law without law” by Mueller, as both use Kolmogorov complexity to predict the distribution of observers. In LwL there is not any underlying reality except numbers, so it is just dust theory over random number fields.
From someone who left a corner of it: the blindspots could be arbitrarily large as far as I know, because there seemed to me to be no real explicit culture of Hamming questions/metalooking for anything neglected. You worked on something vaguely similar/related to your advisor’s work, because otherwise you can’t get connections to people who know how to attack the problem.
What is going on with Constitution AI? Does anyone know why no LLM aside from Claude (at least none that I can find) has used it? One would think that if it works about as well as RLHF (which it seems to), AI companies would be flocking to it to save on the cost of human labor?
Also, apparently ChatGPT doesn’t know that Constitutional AI is RLAIF (until I reminded it) and Gemini thinks RLAIF and RLHF are the same thing. (Apparently not a fluke as both models made the same error 2 out of 3 times.)
Isn’t the basic idea of Constitutional AI just having the AI provide its own training feedback using written instruction? My guess is there was a substantial amount of self-evaluation in the o1 training with complicated written instructions, probably kind of similar to a constituion (though this is just a guess).
This is my impression too. See e.g. this recent paper from Google, where LLMs critique and revise their own outputs to improve performance in math and coding.
These posts might be relevant:
A recipe for frontier model post-training
Futures of the data foundry business model
The details of Constitutional AI seem highly contingent, while the general idea is simply automation of data for post-training, so that the remaining external input is the “constitution”. In the original paper there are recipes both for instruction tuning data and for preference data. RLAIF is essentially RLHF that runs on synthetic preference data, maybe together with a recipe for generating it. But preference data could also be used to run DPO or something else, in which case RLAIF becomes a misnomer for describing automation of that preference data.
Llama 3 report suggests that instruction tuning data can be largely automated, but human preference data is still better. And data foundry business is still alive, so a lot of human data is at least not widely recognized as useless. But it’s unclear if future models won’t soon do better than humans at labeling, or possibly already do better at some leading labs. Meta didn’t have a GPT-4 level model as a starting point before Llama 3, and then there are the upcoming 5e26 FLOPs models, and o1-like reasoning models.
As a tangent to my question, I wonder how many AI companies are already using RLAIF and not even aware of it. From a recent WSJ story:
So they detected the cheating that time, but in RLHF how would they know if contractors used AI to select which of two AI responses is more preferred?
BTW here’s a poem(?) I wrote for Twitter, actually before coming across the above story:
yyyep
Maybe others are using it in secret but don’t want to admit it for some reason? I can’t find any mention of Anthropic having filed a patent on the idea, but maybe other companies would feel too much like it looked like they were second-rate imitators if they said they were copying Anthropic’s idea?
Just speculating, I don’t know. Sure seems like a useful idea to copy.
AI companies don’t seem to be shy about copying RLHF though. Llama, Gemini, and Grok are all explicitly labeled as using RLHF.
I’m increasingly worried that philosophers tend to underestimate the difficulty of philosophy. I’ve previously criticized Eliezer for this, but it seems to be a more general phenomenon.
Observations:
Low expressed interest in metaphilosophy (in relation to either AI or humans)
Low expressed interest in AI philosophical competence (either concern that it might be low, or desire/excitement for supercompetent AI philosophers with Jupiter-sized brains)
Low concern that philosophical difficulty will be a blocker of AI alignment or cause of AI risk
High confidence when proposing novel solutions (even to controversial age-old questions, and when the proposed solution fails to convince many)
Rarely attacking one’s own ideas (in a serious or sustained way) or changing one’s mind based on others’ arguments
Rarely arguing for uncertainty/confusion (i.e., that that’s the appropriate epistemic status on a topic), with normative ethics being a sometime exception
Possible explanations:
General human overconfidence
People who have a high estimate of difficulty of philosophy self-selecting out of the profession.
Academic culture/norms—no or negative rewards for being more modest or expressing confusion. (Moral uncertainty being sometimes expressed because one can get rewarded by proposing some novel mechanism for dealing with it.)
Philosophy is frequently (probably most of the time) done in order to signal group membership rather than as an attempt to accurately model the world. Just look at political philosophy or philosophy of religion. Most of the observations you note can be explained by philosophers operating at simulacrum level 3 instead of level 1.
“Signal group membership” may be true of the fields you mentioned (political philosophy and philosophy of religion), but seems false of many other fields such as philosophy of math, philosophy of mind, decision theory, anthropic reasoning. Hard to see what group membership someone is signaling by supporting one solution to Sleeping Beauty vs another, for example.
Here are some axes along which I think there’s some group membership signaling in philosophy (IDK about the extent and it’s hard to disentangle it from other stuff):
Math: platonism/intuitionism/computationalism (i.e. what is math?), interpretations of probability, foundations of math (set theory vs univalent foundations)
Mind: externalism/internalism (about whatever), consciousness (de-facto-dualisms (e.g. Chalmers) vs reductive realism vs illusionism), language of thought vs 4E cognition, determinism vs compatibilism vs voluntarism
Metaphysics/ontology: are chairs, minds, and galaxies real? (this is somewhat value-laden for many people)
Biology: gene’s-eye-view/modern synthesis vs extended evolutionary synthesis
I don’t think this is accurate, I think most philosophy is done under motivated reasoning but is not straightforwardly about signaling group membership
I think most academic philosophers take the difficult of philosophy quite seriously. Metaphilosophy is a flourishing subfield of philosophy; you can find recent papers on the topic here https://philpapers.org/browse/metaphilosophy. There is also a growing group of academic philosophers working on AI safety and alignment; you can find some recent work here https://link.springer.com/collections/cadgidecih. I think that sometimes the tone of specific papers sounds confident; but that is more stylistic convention than a reflection of the underlying credences. Finally, I think that uncertainty / decision theory is a persistent theme in recent philosophical work on AI safety and other issues in philosophy of AI; see for example this paper, which is quite sensitive to issues about chances of welfare https://link.springer.com/article/10.1007/s43681-023-00379-1.
Thank you for your view from inside academia. Some questions to help me get a better sense of what you see:
Do you know any philosophers who switched from non-meta-philosophy to metaphilosophy because they become convinced that the problems they were trying to solve are too hard and they needed to develop a better understanding of philosophical reasoning or better intellectual tools in general? (Or what’s the closest to this that you’re aware of?)
Do you know any philosophers who have expressed an interest in ensuring that future AIs will be philosophically competent, or a desire/excitement for supercompetent AI philosophers? (I know 1 or 2 private expressions of the former, but not translated into action yet.)
Do you know any philosophers who are worried that philosophical problems involved in AI alignment/safety may be too hard to solve in time, and have called for something like an AI pause to give humanity more time to solve them? (Even philosophers who have expressed a concern about AI x-risk or are working on AI safety have not taken a position like this, AFAIK.)
How often have you seen philosophers say something like “Upon further reflection, my proposed solution to problem X has many problems/issues, I’m no longer confident it’s the right approach and now think X is much harder than I originally thought.”
Would also appreciate any links/citations/quotes (if personal but sharable communications) on these.
These are all things I’ve said or done due to high estimate of philosophical difficulty, but not (or rarely) seen among academic philosophers, at least from my casual observation from outside academia. It’s also possible that we disagree on what estimate of philosophical difficulty is appropriate (such that for example you don’t think philosophers should often say or do these things), which would also be interesting to know.
Another academic philosopher, directed here by @Simon Goldstein. Hello Wei!
It’s not common to switch entirely to metaphilosophy, but I think lots of us get more interested in the foundations and methodology of at least our chosen subfields as we gain experience, see where progress is(n’t) being made, start noticing deep disagreements about the quality of different kinds of work, and so on. It seems fair to describe this as awakening to a need for better tools and a greater understanding of methods. I recently wrote a paper about the methodology of one of my research areas, philosophy of mathematical practice, for pretty much these reasons.
Current LLMs are pretty awful at discussing the recent philosophy literature, so I think anyone who’d like AI tools to serve as useful research assistants would be happy to see at least some improvement here! I’m personally also excited about the prospects of using language models with bigger context windows for better corpus analysis work in empirical and practice-oriented parts of philosophy.
I basically agree with Simon on this.
I don’t think this is uncommon. You might not see these reversals in print often, because nobody wants to publish and few people want to read a paper that just says “I retract my previous claims and no longer have a confident positive view to offer”. But my sense is that philosophers often give up on projects because the problems are piling up and they no longer see an appealing way forward. Sometimes this happens more publicly. Hilary Putnam, one of the most influential philosophers of the later 20th century, was famous for changing his mind about scientific realism and other basic metaphysical issues. Wesley Salmon gave up his influential “mark transmission” account of causal explanation due to counterexamples raised by Kitcher (as you can read here). It would be easy enough to find more examples.
Great questions. Sadly, I don’t have any really good answers for you.
I don’t know of specific cases, but for example I think it is quite common for people to start studying meta-ethics because of frustration at finding answers to questions in normative ethics.
I do not, except for the end of Superintelligence
Many of the philosophers I know who work on AI safety would love for there to be an AI pause, in part because they think alignment is very difficult. But I don’t know if any of us have explicitly called for an AI pause, in part because it seems useless, but may have opportunity cost.
I think few of my friends in philosophy have ardently abandoned a research project they once pursued because they decided it wasn’t the right approach. I suspect few researchers do that. In my own case, I used to work in an area called ‘dynamic semantics’, and one reason I’ve stopped working on that research project is that I became pessimistic that it had significant advantages over its competitors.
The FLI Pause letter didn’t achieve a pause, but it dramatically shifted the Overton Window.
Thanks, it’s actually very interesting and important information.
I’ve noticed (and stated in the OP) that normative ethics seems to be an exception where it’s common to express uncertainty/confusion/difficulty. But I think, from both my inside and outside views, that this should be common in most philosophical fields (because e.g. we’ve been trying to solve them for centuries without coming up with broadly convincing solutions), and there should be a steady stream of all kinds of philosophers going up the meta ladder all the way to metaphilosophy. It recently dawned on me that this doesn’t seem to be the case.
What seems useless, calling for an AI pause, or the AI pause itself? Have trouble figuring out because if “calling for an AI pause”, what is the opportunity cost (seems easy enough to write or sign an open letter), and if “AI pause itself”, “seems useless” contradicts “would love”. In either case, this seems extremely important to openly discuss/debate! Can you please ask these philosophers to share their views of this on LW (or their preferred venue), and share your own views?
FTR I’d probably be up for helping out logistically with such an open letter (e.g. making the website and any other parts of it). I previously made this open letter.
I think there’s a steady stream of philosophy getting interested in various questions in metaphilosophy; metaethics is just the most salient to me. One example is the recent trend towards conceptual engineering (https://philpapers.org/browse/conceptual-engineering). Metametaphysics has also gotten a lot of attention in the last 10-20 years https://www.oxfordbibliographies.com/display/document/obo-9780195396577/obo-9780195396577-0217.xml. There is also some recent work in metaepistemology, but maybe less so because the debates tend to recapitulate previous work in metaethics https://plato.stanford.edu/entries/metaepistemology/.
Sorry for being unclear, I meant that calling for a pause seems useless because it won’t happen. I think calling for the pause has opportunity cost because of limited attention and limited signalling value; reputation can only be used so many times; better to channel pressure towards asks that could plausibly get done.
Thanks for this info and the references. I guess by “metaphilosophy” I meant something more meta than metaethics or metaepistemology, i.e., a field that tries to understand all philosophical reasoning in some unified or systematic way, including reasoning used in metaethics and metaepistemology, and metaphilosophy itself. (This may differ from standard academic terminology, in which case please let me know if there’s a preferred term for the concept I’m pointing at.) My reasoning being that metaethics itself seems like a hard problem that has defied solution for centuries, so why stop there instead of going even more meta?
I think you (and other philosophers) may be too certain that a pause won’t happen, but I’m not sure I can convince you (at least not easily). What about calling for it in a low cost way, e.g., instead of doing something high profile like an open letter (with perceived high opportunity costs), just write a blog post or even a tweet saying that you wish for an AI pause, because …? What if many people privately prefer an AI pause, but nobody knows because nobody says anything? What if by keeping silent, you’re helping to keep society in a highly suboptimal equilibrium?
I think there are also good arguments for doing something like this from a deontological or contractualist perspective (i.e. you have a duty/obligation to honestly and publicly report your beliefs on important matters related to your specialization), which sidestep the “opportunity cost” issue, but I’m not sure if you’re open to that kind of argument. I think they should have some weight given moral uncertainty.
Hm. I think modern academic philosophy is a raging shitshow, but I thought philosophy on LW was quite good. I haven’t been a regular LW user until a couple of years ago, and the philosophical takes here, particularly Eliezer’s, converge with my own conclusions after a half lifetime of looking at philosophical questions through the lens of science, particularly neuroscience and psychology.
So: what do you see as the limitations in LW/Yudkowskian philosophy? Perhaps I’ve overlooked them.
I am currently skeptical that we need better philosophy for good AGI outcomes, vs. better practical work on technical AGI alignment (a category that barely exists) and PR work to put the likely personal intent aligned AGI into the hands of people that give half a crap about understanding or implementing ethics. Deciding on the long term future will be a matter of a long contemplation if we get AGI into good hands. We should decide if that logic is right, and if so, plan the victory party after we’ve won the war.
I did read your metaphilosophy post and remain unconvinced that there’s something big the rest of us are missing.
I’m happy to be corrected (I love becoming less wrong, and I’m aware of many of my biases that might prevent it):
Here’s how it currently looks to me: Ethics are ultimately a matter of preference, the rest is game theory and science (including the science of human preferences). Philosophical questions boil down to scientific questions in most cases, so epistemology is metaphilosophy for the most part.
Change my mind! Seriously, I’ll listen. It’s been years since I’ve thought about philosophy hard.
I was just reading Daniel Dennett’s memoir for no reason in particular, it had some interesting glimpses into how professional philosophers actually practice philosophy. Like I guess there’s a thing where one person reads their paper (word-for-word!) and then someone else is the designated criticizer? I forget the details. Extremely different from my experience in physics academia though!!
(Obviously, reading that memoir is probably not the most time-efficient way to learn about the day-to-day practice of academic philosophy.)
(Oh, there was another funny anecdote in the memoir where the American professional philosopher association basically had a consensus against some school of philosophy, and everyone was putting it behind them and moving on, but then there was a rebellion where the people who still liked that school of philosophy did a hostile takeover of the association’s leadership!)
A non-ethics example that jumps to my mind is David Chalmers on the Hard Problem of Consciousness here: “So if I’m giving my overall credences, I’m going to give, 10% to illusionism, 30% to panpsychism, 30% to dualism, and maybe the other 30% to, I don’t know what else could be true, but maybe there’s something else out there.” That’s the only example I can think of but I read very very little philosophy.
What are the issues that are “difficult” in philosophy, in your opinion? What makes them difficult?
I remember you and others talking about the need to “solve philosophy”, but I was never sure what it meant by that.
To whom does this not apply? Most people who “work on AI alignment” don’t even think that thinking is a thing.
@Nate Showell @P. @Tetraspace @Joseph Miller @Lorxus
I genuinely don’t know what you want elaboration of. Reacts are nice for what they are, but saying something out loud about what you want to hear more about / what’s confusing / what you did and didn’t understand/agree with, is more helpful.
Re/ “to whom not...”, I’m asking Wei: what groups of people would not be described by the list of 6 “underestimating the difficult of philosophy” things? It seems to me that broadly, EAs and “AI alignment” people tend to favor somewhat too concrete touchpoints like “well, suppressing revolts in the past has gone like such and such, so we should try to do similar for AGI”. And broadly they don’t credit an abstract argument about why something won’t work, or would only work given substantial further philosophical insight.
Re/ “don’t think thinking …”, well, if I say “LLMs basically don’t think”, they’re like “sure it does, I can keep prompting it and it says more things, and I can even put that in a scaffold” or “what concrete behavior can you point to that it can’t do”. Like, bro, I’m saying it can’t think. That’s the tweet. What thinking is, isn’t clear, but That thinking is should be presumed, pending a forceful philosophical conceptual replacement!
That is, in fact, a helpful elaboration! When you said
my leading hypotheses for what you could mean were:
Using thought, as a tool, has not occured to most such people
Most such people have no concept whatsoever of cognition as being a thing, the way people in the year 1000 had no concept whatsoever of javascript being a thing.
Now, instead, my leading hypothesis is that you mean:
Most such people are failing to notice that there’s an important process, called “thinking”, which humans do but LLMs “basically” don’t do.
This is a bunch more precise! For one, it mentions AIs at all.
As my reacts hopefully implied, this is exactly the kind of clarification I needed—thanks!
Sure, but you’re not preaching to the choir at that point. So surely the next step in that particular dance is to stick a knife in the crack and twist?
That is -
And if neither you nor anyone else can do either of those things… maybe it’s time to give up and say that this ‘thinking’/‘reasoning’ thing is just philosophically confused? I don’t think that that’s where we’re headed, but I find it important to explicitly acknowledge the possibility; I don’t deal in more than one epiphenomenon at a time and I’m partial to Platonism already. So if this ‘reasoning’ thing isn’t meaningfully distinguishable in some observable way from what LLMs do, why shouldn’t I simply give in?
I’ve had this tweet pinned to my Twitter profile for a while, hoping to find some like-minded people, but with 13k views so far I’ve yet to get a positive answer (or find someone expressing this sentiment independently):
This aside, there are some people around LW/rationality who seem more cautious/modest/self-critical about proposing new philosophical solutions, like MIRI’s former Agent Foundations team, but perhaps partly as a result of that, they’re now out of a job!
Yeah that was not my reaction. (More like “that’s going to be the most beautiful thing ever” and “I want to be that too”.)
No, if anything the job loss resulted from not doing so much more, much more intently, and much sooner.
Having worked on some of the problems myself (e.g. decision theory), I think the underlying problems are just very hard. Why do you think they could have done “so much more, much more intently, and much sooner”?
The type of fundamental problem that proper speculative philosophy is supposed to solve is the sort where streetlighting doesn’t work (or isn’t working, or isn’t working fast enough). But nearly all of the alignment field after like 2004 was still basically streetlighting. It was maybe a reasonable thing to have some hope in prospectively, but retrospectively it was too much investment in streetlighting, and retrospectively I can make arguments about why one should have maybe guessed that at the time. By 2018 IIRC, or certainly by 2019, I was vociferously arguing for that in AF team meetings—but the rest of the team either disagreed with me or didn’t understand me, and on my own I’m just not that good a thinker, and I didn’t find anyone else to try it with. I think they have good thoughts, but are nevertheless mostly streetlighting—i.e. not trying to take step after step of thinking at the level of speculative philosophy AND aimed at getting the understanding needed for alignment.
My understanding of what happened (from reading this) is that you wanted to explore in a new direction very different from the then preferred approach of the AF team, but couldn’t convince them (or someone else) to join you. To me this doesn’t clearly have much to do with streetlighting, and my current guess is that it was probably reasonable of them to not be convinced. It was also perfectly reasonable of you to want to explore a different approach, but it seems unreasonable to claim without giving any details that it would have produced better results if only they had listened to you. (I mean you can claim this, but why should I believe you?)
If you disagree (and want to explain more), maybe you could either explain the analogy more fully (e.g., what corresponds to the streetlight, why should I believe that they overexplored the lighted area, what made you able to “see in the dark” to pick out a more promising search area or did you just generally want to explore the dark more) and/or try to convince me on the object level / inside view that your approach is or was more promising?
(Also perfectly fine to stop here if you want. I’m pretty curious on both the object and meta levels about your thoughts on AF, but you may not have wanted to get into such a deep discussion when you first joined this thread.)
If you say to someone
and then you listen to what they say, and reask the question and interrogate their answers, IME what it very often grounds out into is something like:
So I would summarize that position as endorsing streetlighting, in a very broad sense that encompasses most math / science / technology. And this position is largely correct! My claim is that
this is probably too slow for making Friendly AI, and
maybe one could go faster by trying to more directly cleave to the core philosophical problems.
I discuss the problem more here: https://tsvibt.blogspot.com/2023/09/a-hermeneutic-net-for-agency.html
(But note that, while that essay frames things as “a proposed solution”, the solution is barely anything—more like a few guesses at pieces of methodology—and the main point is the discussion of the problem; maybe a writing mistake.)
An underemphasized point that I should maybe elaborate more on: a main claim is that there’s untapped guidance to be gotten from our partial understanding—at the philosophical level and for the philosophical level. In other words, our preliminary concepts and intuitions and propositions are, I think, already enough that there’s a lot of progress to be made by having them talk to each other, so to speak.
OK but what would this even look like?\gen
Toss away anything amenable to testing and direct empirical analysis; it’s all too concrete and model-dependent.
Toss away mathsy proofsy approaches; they’re all too formalized and over-rigid and can only prove things from starting assumptions we haven’t got yet and maybe won’t think of in time.
Toss away basically all settled philosophy, too; if there were answers to be had there rather than a few passages which ask correct questions, the Vienna Circle would have solved alignment for us.
What’s left? And what causes it to hang together? And what causes it not to vanish up its own ungrounded self-reference?
From scratch but not from scratch. https://www.lesswrong.com/posts/noxHoo3XKkzPG6s7E/most-smart-and-skilled-people-are-outside-of-the-ea?commentId=DNvmP9BAR3eNPWGBa
https://tsvibt.blogspot.com/2023/09/a-hermeneutic-net-for-agency.html
What makes you think there are any such ‘answers’, renderable in a form that you could identify?
And even if they do exist, why do you think a human being could fully grasp the explanation in finite time?
Edit: It seems quite possible that even the simplest such ‘answers’ could require many years of full time effort to understand, putting it beyond most if not all human memory capacity. i.e. By the end even those who ‘learned’ it will have forgotten many parts near the beginning.
(Upvoted since your questions seem reasonable and I’m not sure why you got downvoted.)
I see two ways to achieve some justifiable confidence in philosophical answers produced by superintelligent AI:
Solve metaphilosophy well enough that we achieve an understanding of philosophical reasoning on par with mathematical reason, and have ideas/systems analogous to formal proofs and mechanical proof checkers that we can use to check the ASI’s arguments.
We increase our own intelligence and philosophical competence until we can verify the ASI’s reasoning ourselves.
I blame science, math, engineering, entrepreneurship. Philosophy is the practice of the esoteric method, meaning it can’t be made truly legible for very long stretches of investigation. This results in accumulation of anti-epistemic hazards, which science doesn’t particularly need to have tools for dealing with, because it can filter its reasoning through frequent transitions into legibility.
Philosophy can’t rely on such filtering through legibility, it has to maintain sanity the hard way. But as philosophy enviously looks at the more successful endeavors of science, it doesn’t see respect for such methods of maintaining sanity in its reasoning, instead it sees that merely moving fast and breaking things works very well. And so the enthusiasm for their development wanes, instead philosophy remains content with the object level questions that investigate particular truths, rather than methods for getting better at telling which cognitive algorithms can more robustly arrive at truths (rationality, metaphilosophy).
About a week ago FAR.AI posted a bunch of talks at the 2024 Vienna Alignment Workshop to its YouTube channel, including Supervising AI on hard tasks by Jan Leike.
Crossposting from X:
High population may actually be a problem, because it allows the AI transition to occur at low average human intelligence, hampering its governance. Low fertility/population would force humans to increase average intelligence before creating our successor, perhaps a good thing!
This assumes that it’s possible to create better or worse successors, and that higher average human intelligence would lead to smarter/better politicians and policies, increasing our likelihood of building better successors.
Some worry about low fertility leading to a collapse of civilization, but embryo selection for IQ could prevent that, and even if collapse happens, natural selection would start increasing fertility and intelligence of humans again, so future smarter humans should be able to rebuild civilization and restart technological progress.
Added: Here’s an example to illustrate my model. Assume a normally distributed population with average IQ of 100 and we need a certain number of people with IQ>130 to achieve AGI. If the total population was to half, then to get the same absolute number of IQ>130 people as today, average IQ would have to increase by 4.5, and if the population was to become 1⁄10 of the original, average IQ would have to increase by 18.75.
I find the idea interesting: To achieve a certain value of “total genius”, we either need a large population with a small fraction of geniuses, or a small population with a large fraction of geniuses.
(A third option is a small population with a small fraction of geniuses… and it takes a lot of time. The geniuses read each other’s books, rather than talk to each other directly. I think it was like this in the past. Very inefficient, because the information transfer by reading books is one-sided; does not allow collaboration in real time.)
I wonder how the heritability of IQ works, versus the reversion to the mean. Despite Pol Pot’s dystopian project, the average IQ in Cambodia seems to be… average. What would happen to a country where let’s say half of the children are produced by artificial insemination, and half of the sperm comes from fathers with IQ 130 and above? If the mother is average, the child is likely to be an average between 100 and 130, so 115. On one hand, nothing exceptional; on the other hand, if the baseline is now slightly higher, then the next generation… and here comes the question how exactly the reversion to the mean works, and whether the constant injections of IQ 130 genes in the population could outrun it.
Reversion to the mean happens because the phenotype is an imperfect proxy for the genotype, so if you select a person with phenotypic IQ of 130, maybe their genotypic IQ is only 124, and therefore their offspring with an IQ 100 person would only be IQ 112 in expectation.
However, this expectation applies to both the offspring’s genotypic and phenotypic IQ, so this is the only regression to the mean you’re going to see; it’s not going to regress further down the line.
I wrote Smart Losers a long time ago, trying to understand/explain certain human phenomena. But the model could potentially be useful for understanding (certain aspects of) human-AI interactions as well.
Possibly relevant anecdote: Once I was with a group of people who tried various psychological experiments. That day, the organizers proposed that we play iterated Prisonner’s Dilemma. I was like “yay, I know the winning strategy, this will be so easy!”
I lost. Almost everyone always defected against me; there wasn’t much I could do to get points comparable to other people who mostly cooperated with each other.
After the game, I asked why. (During the game, we were not allowed to communicate, just to write our moves.) The typical answer was something like: “well, you are obviously very smart, so no matter what I do, you will certainly find a way to win against me, so my best option is to play it safe and always defect, to avoid the worst outcome”.
I am not even sure if I should be angry at them. I suppose that in real life, when you have about average intelligence, “don’t trust people visibly smarter than you” is probably a good strategy, on average, because there are just too many clever scammers walking around. At the same time I feel hurt, because I am a natural altruist and cooperator, so this feels extremely unfair, and a loss for both sides.
(There were other situations in my life where the same pattern probably also applied, but most of the time, you just don’t know why other people do whatever they do. This time I was told their reasoning explicitly.)
What (human or not) phenomena do you think are well explained by this model? I tried to think of any for 5 minutes and the best I came up with was the strong egalitarianism among hunter gatherers. I don’t actually know that much about hunter gatherers though. In the modern world something where “high IQ” people are doing worse is sex, but it doesn’t seem to fit your model.
Human-human: Various historical and current episodes of smarter-than-average populations being persecuted or discriminated against, such as intellectuals, “capitalists” (i.e., people labeled as such), certain ethnic groups. (I’m unsure my model is actually a good explanation of such phenomena, but this is mainly what I was trying to explain.)
Human-AI: Many people being reluctant to believe that it’s a good idea to build unaligned artificial superintelligence and then constraining them with a system of laws and/or social norms (which some people like Robin Hanson and Mathew Barnett have proposed). Aside from the issue of violent overthrow, any such system is bound to have loopholes, which the ASI will be more adept at exploiting, yet this adeptness potentially causes the ASI to be worse off (less likely to exist in the first place), similar to what happens in my model.
Are humans fundamentally good or evil? (By “evil” I mean something like “willing to inflict large amounts of harm/suffering on others in pursuit of one’s own interests/goals (in a way that can’t be plausibly justified as justice or the like)” and by “good” I mean “most people won’t do that because they terminally care about others”.) People say “power corrupts”, but why isn’t “power reveals” equally or more true? Looking at some relevant history (people thinking Mao Zedong was sincerely idealistic in his youth, early Chinese Communist Party looked genuine about wanting to learn democracy and freedom from the West, subsequent massive abuses of power by Mao/CCP lasting to today), it’s hard to escape the conclusion that altruism is merely a mask that evolution made humans wear in a context-dependent way, to be discarded when opportune (e.g., when one has secured enough power that altruism is no longer very useful).
After writing the above, I was reminded of @Matthew Barnett’s AI alignment shouldn’t be conflated with AI moral achievement, which is perhaps the closest previous discussion around here. (Also related are my previous writings about “human safety” although they still used the “power corrupts” framing.) Comparing my current message to his, he talks about “selfishness” and explicitly disclaims, “most humans are not evil” (why did he say this?), and focuses on everyday (e.g. consumer) behavior instead of what “power reveals”.
At the time, I replied to him, “I think I’m less worried than you about “selfishness” in particular and more worried about moral/philosophical/strategic errors in general.” I guess I wasn’t as worried because it seemed like humans are altruistic enough, and their selfish everyday desires limited enough that as they got richer and more powerful, their altruistic values would have more and more influence. In the few months since then, I’ve became more worried, perhaps due to learning more about Chinese history and politics...
The reason I said “most humans are not evil” is because I honestly don’t think the concept of evil, as normally applied, is a truthful way to describe most people. Evil typically refers to an extraordinary immoral behavior, in the vicinity of purposefully inflicting harm to others in order to inflict harm intrinsically, rather than out of indifference, or as a byproduct of instrumental strategies to obtain some other goal. I think the majority of harms that most people cause are either (1) byproducts of getting something they want, which is not in itself bad (e.g. wanting to eat meat), or (2) the result of their lack of will to help others (e.g. refusing to donate any income to those in poverty).
By contrast, I focused on consumer behavior because the majority of the world’s economic activity is currently engaged in producing consumer products and services. There exist possible worlds in which this is not true. During World War 2, the majority of GDP in Nazi Germany was spent on hiring soldiers, producing weapons of war, and supporting the war effort more generally—which are not consumer goods and services.
Focusing on consumer preferences a natural thing to focus on if you want to capture intuitively “what humans are doing with their wealth”, at least in our current world. Before focusing on something else by default—such as moral preferences—I’d want to hear more about why those things are more likely to be influential than ordinary consumer preferences in the future.
You mention one such argument along these lines:
I just think it’s not clear it’s actually true that humans get more altruistic as they get richer. For example, is it the case that selfish consumer preferences have gotten weaker in the modern world, compared to centuries ago when humans were much poorer on a per capita basis? I have not seen a strong defense of this thesis, and I’d like to see one before I abandon my focus on “everyday (e.g. consumer) behavior”.
Ok, I guess we just define/use it differently. I think most people we think of as “evil” probably justify inflicting harm to others as instrumental to some “greater good”, or are doing it to gain or maintain power, not because they value it for its own sake. I mean if someone killed thousands of people in order to maintain their grip on power, I think we’d call them “evil” and not just “selfish”?
I’m pretty sure that billionaires consume much less as percent of their income, compared to the average person. EA funding comes disproportionately from billionaires, AFAIK. I personally spend a lot more time/effort on altruistic causes, compared to if I was poorer. (Not donating much though for a number of reasons.)
I’m thinking that we just haven’t reached that inflection point yet, where most people run out of things to spend selfishly on (like many billionaires have, and like I have to a lesser extent). As I mentioned in my reply to your post, a large part of my view comes from not being able to imagine what people would spend selfishly on, if each person “owned” something like a significant fraction of a solar system. Why couldn’t 99% of their selfish desires be met with <1% of their resources? If you had a plausible story you could tell about this, that would probably change my mind a lot. One thing I do worry about is status symbols / positional goods. I tend to view that as a separate issue from “selfish consumption” but maybe you don’t?
I like the insight regarding power corrupting or revealing. I think perhaps both might be true and, if so, we should keep both lines of though in mind when thinking about these types of questions.
My general view is that most people are generally good when you’re talking about individual interactions. I’m less confident in that when one brings in the in group-out of group aspects. I just am not sure how to integrate all that into a general view or princple about human nature.
A line I heard in some cheesey B-grade horror movies, related to the question of a personal nature and the idea that we all have competing good and bad wolves inside. One of the characters asks which wolve was strongest, the good wolf or the bad wolf. The answer was “Which do you feed the most?”
My model is that the concept of “morality” is a fiction which has 4 generators that are real:
People have empathy, which means they intrinsically care about other people (and sufficiently person-like entities), but, mostly about those in their social vicinity. Also, different people have different strength of empathy, a minority might have virtually none.
Superrational cooperation is something that people understand intuitively to some degree. Obviously, a minority of people understand it on System 2 level as well.
There is something virtue-ethics-like which I find in my own preferences, along the lines of “some things I would prefer not to do, not because of their consequences, but because I don’t want to be the kind of person who would do that”. However, I expect different people to differ in this regard.
The cultural standards of morality, which it might be selfishly beneficial to go along with, including lying to yourself that you’re doing it for non-selfish reasons. Which, as you say, becomes irrelevant once you secure enough power. This is a sort of self-deception which people are intuitively skilled at.
I don’t think altruism is evolutionarily connected to power as you describe. Caesar didn’t come to power by being better at altruism, but by being better at coordinating violence. For a more general example, the Greek and other myths don’t give many examples of compassion (though they give many other human values), it seems the modern form of compassion only appeared with Jesus, which is too recent for any evolutionary explanation.
So it’s possible that the little we got of altruism and other nice things are merely lucky memes. Not even a necessary adaptation, but more like a cultural peacock’s tail, which appeared randomly and might fix itself or not. While our fundamental nature remains that of other living creatures, who eat each other without caring much.
I think the way morality seems to work in humans is that we have a set of potential moral values, determined by our genes, that culture can then emphasize or de-emphasize. Altruism seems to be one of these potential values, that perhaps got more emphasized in recent times, in certain cultures. I think altruism isn’t directly evolutionarily connected to power, and it’s more like “act morally (according to local culture) while that’s helpful for gaining power” which translates to “act altruistically while that’s helpful for gaining power” in cultures that emphasize altruism. Does this make more sense?
Yeah, that seems to agree with my pessimistic view—that we are selfish animals, except we have culture, and some cultures accidentally contain altruism. So the answer to your question “are humans fundamentally good or evil?” is “humans are fundamentally evil, and only accidentally sometimes good”.
I think that there is a version of an altruistic pursuit where one will, by default, “reduce his power.” I think this scenario happens when, in the process of attempting to do good, one exposes himself more to unintended consequences. The person who sacrifices will reduce his ability to exercise power, but he may regain or supersede such loss if the tribe agrees with his rationale for such sacrifice.
Just because it was not among the organizing principles of any of the literate societies before Jesus does not mean it is not part of the human mental architecture.
Yes, obviously.
The vast majority of people would inflict huge amounts of disutility on others if they thought they could get away with it and benefitted from it.
What then prevents humans from being more terrible to each other? Presumably, if the vast majority of people are like this, and they know that the vast majority of others are also like this, up to common knowledge, I don’t see how you’d get a stable society in which people aren’t usually screwing each other a giant amount.
Any thoughts on why, if it’s obvious, it’s seldomly brought up around here (meaning rationalist/EA/AI safety circles)?
There are several levels in which humans can be bad or evil:
Doing bad things because they believe them to be good
Doing bad things while not caring whether they are bad or not
Doing bad things because they believe them to be bad (Kant calls this “devilish”)
I guess when humans are bad, they usually do 1). Even Hitler may have genuinely thought he is doing the morally right thing.
Humans also sometimes do 2), for minor things. But rarely if the anticipated bad consequences are substantial. People who consistently act according to 2) are called psychopaths. They have no inherent empathy for other people. Most humans are not psychopathic.
Humans don’t do 3), they don’t act evil for the sake of it. They aren’t devils.
I’m sharing a story about the crew of Enterprise from Star Trek TNG[1].
This was written with AI assistance, and my workflow was to give the general theme to AI, have it write an outline, then each chapter, then manually reorganize the text where needed, request major changes, point out subpar sentences/paragraphs for it to rewrite, and do small manual changes. The AI used was mostly Claude 3.5 Sonnet, which seems significantly better than ChatGPT-4o and Gemini 1.5 Pro at this kind of thing.
getting an intelligence/rationality upgrade, which causes them to deduce their fictional nature.
Nice story! Could you copy-paste here the prompt you used? I’m curious what bits of plot were added by you and which were improvised by the AI.
Thanks! I was unsure whether to post this, because I’m worried about contributing to a future where AI generated content overwhelms everything, and I wasn’t sure the near-term value of the story is high enough to justify posting it anyway. (I thought it was a nice story, but am likely biased. Also asked some AI chatbots, who gave positive reviews, but the risk of them being biased is also very high.)
Hard to copy-paste the whole prompt, because there was a lot of back-and-forth. I maxed out the query limits on several free accounts I created for my family, over several days, to produce the story. Maybe I’ll just describe what I did in more detail. Started with
which resulted in this initial outline:
I. Teaser
The Enterprise responds to a distress call from a Federation outpost
Upon arrival, they find the outpost abandoned, with only a mysterious device left behind
II. Act One
Data activates the device, which emits a pulse of energy
Crew members begin noticing small inconsistencies in their daily routines
Geordi questions why they don’t use replicators to solve resource shortages on colonies
III. Act Two
More crew members report strange observations
Dr. Crusher wonders why they don’t use transporter technology to cure diseases or injuries
Worf ponders the improbability of so many humanoid species evolving independently
Picard becomes concerned about the crew’s unusual behavior
IV. Act Three
Riker raises the question of why they don’t use transporters for backing up crew members
The senior staff meets to discuss the mounting inconsistencies
Data theorizes that the mysterious device has enhanced their critical thinking abilities
Picard orders the ship to return to the outpost for further investigation
V. Act Four
At the outpost, they discover records of experiments in “cognitive enhancement”
The crew starts questioning the nature of their reality
Some crew members experience existential crises
The ship’s computer begins to malfunction as the crew’s doubts intensify
VI. Act Five
Data determines that their enhanced cognition is causing a feedback loop in the ship’s systems
The crew must find a way to deactivate the device and return to normal
Philosophical debate ensues about whether they should retain their newfound awareness
Picard makes the difficult decision to deactivate the device
VII. Epilogue
The crew returns to normal, but with a lingering sense of unease
Picard records a captain’s log reflecting on the experience and the nature of their reality
Final scene shows a subtle glitch in the ship’s operations, leaving a sense of ambiguity
Then the main things I recall doing:
Ask AI to add parts about audience not having critical thinking, and the crew trying to infer other attributes of the audience and writers.
Add second part of conversation between Guinan and Ensign, with both of them having second thoughts about what Guinan initially said.
Add part about suspending ship computer to sandbox, to make clear it wasn’t harmed.
Ask AI for additional realizations/inferences the crew might have, and add selected ones to the story.
Ask AI to change the ending so that the crew retained their new abilities, instead of returning to normal.
Pointing out a bunch of sentences/paragraphs that didn’t make as much sense as they could, with AI generally able to improve them a lot in one try.
Also tried to get AI to find issues with the story on its own, but that didn’t work very well.
Manually make some edits, like changing a word here and there, deleting parts that were repetitive, moving parts arounds to make the story flow better.
So, it wasn’t your idea for the crew to start theorizing that they were fictional characters? Nor was it your idea for them to theorize that they were fictional characters in a story written by AI?
Yeah, it looks like I did not directly tell the AI to have the crew start theorizing that they are fictional characters. After the initial outline (which did not mention fictional characters) was produced, I fed it back to a new instance of Claude and asked:
And as part of its answer it suggested:
Then I wrote
Which made it focus more on the fiction hypothesis.
The part about the story being written by AI was its own idea. No hints from me as far as I can remember.
ETA: Here’s the whole answer to my “make it more philosophically and intellectually interesting” prompt:
This outline presents an intriguing concept that has the potential to delve deep into philosophical and intellectual territory. Here are some suggestions to enhance the script and make it more philosophically and intellectually engaging:
Expand on the nature of reality:
Introduce the simulation hypothesis early on, perhaps through a casual conversation between Data and Geordi.
Have characters question whether their universe is a simulation, a holodeck program, or even a television show.
Explore the implications of living in a simulated reality and how it affects free will and personal identity.
Deepen the existential crisis:
Have characters grapple with the meaning of their existence if their reality is artificial.
Explore how this realization affects their relationships, duties, and sense of purpose.
Include scenes where characters contemplate whether their memories and experiences are “real.”
Introduce a moral dilemma:
If they discover their reality is simulated, should they attempt to contact their “creators”?
Explore the ethics of continuing to exist in a simulated world vs. attempting to “wake up” to a potentially harsher reality.
Have characters debate the morality of creating simulated universes themselves.
Expand on the scientific implications:
Have Data and Geordi theorize about the computational power required to simulate their universe.
Explore how the laws of physics in their universe might be “programmed” and if they can be altered.
Discuss the anthropic principle and fine-tuning of universal constants.
Incorporate elements of epistemology:
Have characters question how they can know anything for certain given their potentially simulated existence.
Explore concepts like Descartes’ “I think, therefore I am” in the context of artificial consciousness.
Discuss the nature of consciousness and whether it can be truly simulated.
Add a temporal element:
Introduce time dilation or time loops as the crew’s enhanced cognition allows them to perceive time differently.
Explore how their perception of time affects their understanding of causality and free will.
Enhance the ending:
Instead of fully returning to normal, have some characters retain a degree of awareness or doubt.
End with a philosophical discussion between Picard and Data about the nature of reality and the value of ignorance vs. knowledge.
Consider a more ambiguous ending that leaves the question of their reality’s nature unresolved.
Include references to philosophical works:
Have characters discuss relevant philosophical concepts, such as Plato’s Allegory of the Cave, Berkeley’s idealism, or Putnam’s brain in a vat thought experiment.
Include a scene where the crew debates different philosophical perspectives on the nature of reality.
Explore the ethical implications of their technology:
Expand on Riker’s question about using transporters for backing up crew members, discussing the philosophical implications of creating copies of consciousness.
Delve deeper into the ethics of using replicators and how it affects societal structures and values.
Incorporate a meta-narrative element:
Have characters become aware of narrative tropes and question why their lives seem to follow certain storytelling patterns.
Explore the idea of narrative causality and how it might shape their reality.
These suggestions aim to deepen the philosophical and intellectual content of the script while maintaining the core Star Trek elements. The enhanced script would not only entertain but also provoke thought and discussion among viewers about the nature of reality, consciousness, and existence.