But it’s trivial to have a straight-line calibration graph, if it’s not straight just fix it for each probability by repeatedly predicting a one-sided coin’s outcome as that probability.
If you’re a prediction market platform where the probability has to be decided by dumb monkeys, just make sure that the vast majority of questions are of the form “will my p-weighted coin land heads”.
---
If a calibration graph isn’t straight, that implies epistemic free lunch—if things that you predict at 20% actually happen 30% of the time, just shift those predictions. This is probably the reason why actual prediction markets are calibrated, since incalibration leads to an easy trading strategy. But the presence of calibration is not a very interesting property.
Calibration is a super important signal of quality because it means you can actually act on the given probabilities! Even if someone is gaming calibration by betting given ratios on certain outcomes, you can still bet on their predictions and not lose money (often). That is far better than other news sources such as tweets or NYT or whatever. If a calibrated predictor and a random other source are both talking about the same thing, the fact that the predictor is calibrated is enough to make them the #1 source on that topic.
Disagree. It’s possible to get a good calibration chart in unimpressive ways, but that’s not how Polymarket & Manifold got their calibration, so their calibration is impressive.
To elaborate: It’s possible to get a good calibration graph by only predicting “easy” questions (e.g. the p-weighted coin), or by predicting questions that are gameable if you ignore discernment (e.g. 1⁄32 for each team to win the Super Bowl), or with an iterative goodharting strategy (e.g. seeing that too many of your “20%” forecasts have happened so then predicting “20%” for some very unlikely things). But forecasting platforms haven’t been using these kinds of tricks, and aren’t designed to. They came by their calibration the hard way, while predicting a diverse set of substantive questions one at a time & aiming for discernment as well as calibration. That’s an accomplishment.
When people are skeptical about the concept of AGI being meaningful or having clear boundaries, it could sometimes be downstream of skepticism about very fast and impactful R&D done by AIs, such as software-only singularity or things like macroscopic biotech where compute buildout happens at a speed impossible for human industry. Such events are needed to serve as landmarks, anchoring a clear concept of AGI, otherwise the definition remains contentious.
So AI company CEOs who complain about AGI being too nebulous to define might already be expecting a scaling slowdown, with their strategy being primarily about the fight for the soul of the 2028-2030 market. When scaling is slow, it’ll become too difficult to gain a significant quality advantage sufficient to defeat the incumbents. So the decisive battle is happening now, with the rhetoric making it more palatable to push through the decisions to build the $140bn training systems of 2028.
This behavior doesn’t need to be at all related to expecting superintelligence, it makes sense as a consequence of not expecting superintelligence in the near future.
I think short timelines just don’t square with the way intelligence agencies are behaving. The NSA took Y2K more seriously than it currently seems to be taking near-term AGI. You can make the argument that intelligence agencies are less competent than they used to be, but I don’t buy that they aren’t at least extremely paranoid and moderately competent: that seems like their job.
Researchers at AGI labs seem to genuinely believe the hype they’re selling, a significant fraction of non-affiliated top-of-the-line DL researchers is inclined to believe them as well, and basically all competent well-informed people agree that the short-timelines position is not unreasonable to hold.
Dismissing short timelines based on NSA’s behavior requires assuming that they’re much more competent in the field of AI than everyone in the above list. After all, that’d require them to be strongly (and correctly) confident that all these superstar researchers above are incorrect.
While that’s not impossible, it seems highly unlikely to me. Much more likely that they’re significantly less competent, and accordingly dismissive.
This is a late reply, but at least from this article, it seems like Ilya Sutskever was running out of confidence that OpenAI would reach AGI by mid 2023. Additionally, if the rumors about GPT-5 are true, it’s mainly going to be a unification of existing models rather than something entirely new. Combined with the GPT-4.5 release, it sure seems like progress at OpenAI is slowing down rather than speeding up.
How do you know that researchers at AGI labs genuinely believe what they’re saying? Couldn’t the companies just put pressure on them to act like they believe Transformative AI is imminent? I just don’t buy that these agencies are dismissive without good reason. They’ve explored remote viewing and other ideas that are almost certainly bullshit. If they are willing to consider those possibilities, I don’t know why they wouldn’t consider the possibility of current deep learning techniques creating a national security threat. That seems like their job, and they’ve explored significantly weirder ideas.
I just don’t buy that these agencies are dismissive without good reason
On what possible publicly-unavailable evidence could they have updated in order to correctly attain such a high degree of dismissiveness?
I could think of three types of evidence:
Strong theoretical reasons.
E. g., some sort of classified, highly advanced, highly empirically supported theory of deep learning/intelligence/agency, such that you can run a bunch of precise experiments, or do a bunch of math derivations, and definitively conclude that DL/LLMs don’t scale to AGI.
Empirical tests.
E. g., perhaps the deep state secretly has 100x the compute of AGI labs, and they already ran the pretraining game to GPT-6 and been disappointed by the results.
Overriding expert opinions.
E. g., a large number of world-class best-of-the-best AI scientists with an impeccable track record firmly and unanimously saying that LLMs don’t scale to AGI. This requires either a “shadow industry” of AI experts working for the government, or for the AI-expert public speakers to be on the deep state’s payroll and lying in public about their uncertainty.
I mean, I guess it’s possible that what we see of the AI industry is just the tip of the iceberg and the government has classified research projects that are a decade ahead of the public state of knowledge. But I find this rather unlikely.
And unless we do postulate that, I don’t see any possible valid pathway by which they could’ve attained high certainty regarding the current paradigm not working out.
They’ve explored remote viewing and other ideas that are almost certainly bullshit
There are two ways we can update on it:
The fact that they investigated psychic phenomena means they’re willing to explore a wide variety of ambitious ideas, regardless of their weirdness – and therefore we should expect them not do dismiss the AGI Risk out of hand.
The fact that they investigated psychic phenomena means they have a pretty bad grip on reality – and therefore we should not expect them to get the AGI Risk right.
I never looked into it enough to know which interpretation is the correct one. Expecting less competence rather than more is usually a good rule of thumb, though.
it sure seems like progress at OpenAI is slowing down rather than speeding up
To be clear, I personally very much agree with that. But:
at least from this article, it seems like Ilya Sutskever was running out of confidence that OpenAI would reach AGI by mid 2023
I find that I’m not inclined to take Sutskever’s current claims about this at face value. He’s raising money for his thing, he has a vested interest in pushing the agenda that the LLM paradigm is a dead end and that his way is the only way. Same how it became advantageous for him to talk about the data wall once he’s no longer with the unlimited-compute company.
Again, I do believe both in LLMs being a dead end and in the data wall. But I don’t trust Sutskever to be a clean source of information regarding that, so I’m not inclined to update on his claims to that end.
Those are good points. The last thing i’ll say drastically reduces the amount of competence required by the government in order for them to be dismissive while still being rational, and it is that the leading AI labs may already be fairly confident that the current techniques of deep-learning won’t get to AGI in the near-future, so the security agencies know this as well.
That would make sense. But I doubt all AGI companies are that good at informational security and deception. This would require all of {OpenAI, Anthropic, DeepMind, Meta, xAI} to decide on the deceptive narrative, and then not fail to keep up the charade, which would require both sending the right public messages and synchronizing their research publications such that the set of paradigm-damning ones isn’t public.
In addition, how do we explain people who quit AGI companies and remain with short timelines?
I guess I would respond to the first point by saying all of the companies you mentioned have incentive to say they are closing in on AGI even if they aren’t. It doesn’t seem that sophisticated to say “we’re close to AGI” when you’re not. Mark Zuckerberg said that AI would be at the level of a junior SWE this year, and Meta proceeded to release Llama 4. Unless prognosticators at Meta seriously fucked up, the most likely scenario is that Zuckerberg made that comment knowing it was bullshit. And the sharing of research did slow down a lot in 2023, which gave companies cover to not release unflattering results.
And to your last point, it seems reasonable that companies could pressure former employees to act as if they believe AGI is imminent. And some researchers may be emotionally invested in believing that what they worked on is what will lead to superintelligence.
And my question for you is: if DeepMind had solid evidence that AGI would be here in 1 year, and if the security agencies had access to DeepMind’s evidence and reasoning, do you believe they would still do nothing?
As someone who thinks superintelligence could come in the near future, I basically agree with @snewman’s view that AIs have to automate the entire economy, or automate a sector that could then automate everything else very fast, but unfortunately for us this basically gives us no good fire alarms for AGI unless @Ege Erdil and @Matthew Barnett et al are right that takeoff is slow enough that most value comes from broad automation, and external use dominates internal use:
I think the first question to think about is how to use them to make CDT decisions. You can create a market about a causal effect if you have control over the decision and you can randomise it to break any correlations with the rest of the world, assuming the fact that you’re going to randomise it doesn’t otherwise affect the outcome (or bettors don’t think it will).
Committing to doing that does render the market useless for choosing policy, but you could randomly decide whether to randomise or to make the decision via whatever the process you actually want to use, and have the market be conditional on the former. You probably don’t want to be randomising your policy decisions too often, but if liquidity wasn’t an issue you could set the probability of randomisation arbitrarily low.
Any update to the market is (equivalent to) updating on some kind of information. So all you can do is dynamically choose what to do or do not update on.* Unfortunately, whenever you choose not to update on something, you are giving up on the asymptotic learning guarantees of policy market setups. So the strategic gains from updatelesness (like not falling into traps) are in a fundamental sense irreconcilable with the learning gains from updatefulness. That doesn’t prevent that you can be pretty smart about deciding what to update on exactly… but due to embededness problems and the complexity of the world, it seems to be the norm (rather than the exception) that you cannot be sure a priori of what to update on (you just have to make some arbitrary choices).
*For avoidance of doubt, what matters for whether you have updated on X is not “whether you have heard about X”, but rather “whether you let X factor into your decisions”. Or at least, this is the case for a sophisticated enough external observer (assessing whether you’ve updated on X), not necessarily all observers.
On YouTube, @Evbo’s parkour civilization and PVP civilization drama movies, professionally produced, set in Minecraft, and half-parody of YA dystopia serves as a surprisingly good demonstration of Instrumental Convergence (the protagonist kills or bribes most people they meet to “rank up” in the beginning), and non-human morality (the characters basically only care about the Minecraft activity of their series, without a hint of irony).
I think using existing non-AI media as an analogy for AI could be helpful, because people think that a terminator-like ASI would be robots shooting people, one of the reasons why a common suggestion for unaligned AI is to just turn it off, pour water on the servers, etc.
I’ve been thinking through the following philosophical argument for the past several months.
1. Most things that currently exist have properties that allow them to continue to exist for a significant amount of time and propagate, since otherwise, they would cease existing very quickly.
2. This implies that most things capable of gaining adaptations, such as humans, animals, species, ideas, and communities, have adaptations for continuing to exist.
3. This also includes decision-making systems and moral philosophies.
4. Therefore, one could model the morality of such things as tending towards the ideal of perfectly maintaining their own existence and propagating as much as possible.
Many of the consequences of this approximation of the morality of things seem quite interesting. For instance, the higher-order considerations of following an “ideal” moral system (that is, utilitarianism using a measure of one’s own continued existence at a point in the future) lead to many of the same moral principles that humans actually have (e.g. cooperation, valuing truth) while also avoiding a lot of the traps of other systems (e.g. hedonism). This chain of thought has led me to believe that existence itself could be a principal component of real-life morality.
While it does have a lot of very interesting conclusions, I’m very concerned that if I were to write about it, I would receive 5 comments directing me to some passage by a respected figure that already discusses the argument, especially given the seemingly incredibly obvious structure it has. However, I’ve searched through LW and tried to research the literature as well as I can (through Google Scholar, Elicit, and Gemini, for instance), but I must not have the right keywords, since I’ve come up fairly empty, other than for philosophers with vaguely similar sounding arguments that don’t actually get at the heart of the matter (e.g. Peter Singer’s work comes up a few times, but he particularly focused on suffering rather than existence itself, and certainly didn’t use any evolutionary-style arguments to reach that conclusion).
If this really hasn’t been written about extensively anywhere, I would update towards believing the hypothesis that there’s actually some fairly obvious flaw that renders it unsound, stopping it from getting past, say, the LW moderation process or the peer review process. As such, I suspect that there is some issue with it, but I’ve not really been able to pinpoint what exactly stops someone from using existence as the fundamental basis of moral reasoning.
Would anyone happen to know of links that do directly explore this topic? (Or, alternatively, does anyone have critiques of this view that would spare me the time of writing more about this if this isn’t true?)
more difficult questions get allocated more reasoning tokens.
higher reasoning “level” doesn’t
https://arxiv.org/html/2503.04697 there is an an optimal amount of reasoning to do at inference-time and doing more degrades performance. (“overthinking”). you can design models to estimate this for themselves.
https://www.paulgraham.com/safe.html this is an analytical argument by Paul Graham; you can be “nice” (not capture the most value, in zero-sum contexts) and it won’t affect your revenue/profits much at all compared to your growth rate.
Periodic reminder: AFAIK (though I didn’t look much) no one has thoroughly investigated whether there’s some small set of molecules, delivered to the brain easily enough, that would have some major regulatory effects resulting in greatly increased cognitive ability. (Feel free to prove me wrong with an example of someone plausibly doing so, i.e. looking hard enough and thinking hard enough that if such a thing was feasible to find and do, then they’d probably have found it—but “surely, surely, surely someone has done so because obviously, right?” is certainly not an accepted proof. And don’t call me Shirley!)
Since 1999 there have been “Doogie” mice that were genetically engineered to overexpress NR2B in their brain, and they were found to have significantly greater cognitive function than their normal counterparts, even performing twice as well on one learning test. No drug AFAIK has been developed that selectively (and safely) enhances NR2B function in the brain, which would best be achieved by a positive allosteric modulator of NR2B, but also no drug company has wanted to or tried to specifically increase general intelligence/IQ in people, and increasing IQ in healthy people is not recognized as treating a disease or even publicly supported. The drug SAGE718 comes close, but it is a pan-NMDA allosteric (which still showed impressive increases in cognitive end-points in its trial) Theoretically, if we try to understand how general intelligence/IQ works in a pharmacological sense, then we should be able to develop drugs that affect IQ. Two ways to do that is investigating the neurological differences between individuals with high IQ and those with average IQ, and mapping out the function of brain regions implicated in IQ e.g. the dorsolateral prefrontal cortex (dlPFC). If part of the differences between high and average IQs is neurotransmitter based and could be emulated with small molecules, then such drugs could be developed. Genomic studies already link common variation in postsynaptic NMDA-complex genes and in nicotinic receptor genes (e.g., CHRNA4) to small differences in cognitive test scores across populations. Likewise, key brain regions like the dlPFC could be positively modulated, e.g. we know persistent‐firing delay cells in the macaque dlPFC rely on slow NMDA-receptor-mediated recurrent excitation, and their activity is mainly gated by acetylcholine acting at both α7 nicotinic and M1 muscarinic receptors. So, positively tuning delay cell firing with a7 and M1 ligands augments your dlPFC. Indeed, electrophysiological and behavioral experiments show that low-dose stimulation or positive-allosteric modulation (PAM) of either receptor subtype enhances delay-period firing and working-memory performance, whereas blockade or excessive stimulation impairs them.
There are in fact drugs, either very recently developed and currently in trials with sound mechanisms as described above that support significant cognitive enhancement, or that have already shown very impressive cognitive improvement in animals and humans in past trials despite not being specifically developed for cognitive enhancement and rather diseases like Alzheimer’s, or conditions like depression, e.g. TAK653 (AMPA PAM), ACD856 (TrkB PAM), Tropisetron (a7 partial agonist), Neboglamine (NMDA Glycine PAM), BPN14770 (PDE4D NAM), SAGE718 (NMDA PAM), TAK071 and AF710B (M1 positive allosterics).
There is a small community of nootropics enthusiasts (r/Nootopics and its discord) that have tried and tested some of these compounds and reported significant cognitive enhancement, with TAK653 increasing IQ by as much as 7 points in relatively decent online IQ tests (e.g. mensa.no, mensa.dk) and also professionally administered tests (that weren’t taken twice to minimize retake effects), and cognitive benchmarks like humanbenchmark.com, as well as the WAIS Digit Span subtest likewise showing improvements. The rationale behind TAK653 (also called “Osavampator”) and AMPA PAMs is that positive-allosteric modulators of AMPA-type glutamate receptors such as TAK653 boost the size and duration of fast excitatory postsynaptic currents without directly opening the channel. That extra depolarization recruits NMDA receptors, calcium influx, and a rapid BDNF-mTOR signaling cascade that produces spine growth and long-term potentiation. In rodents, low-nanomolar brain levels of TAK-653 have been shown to rescue or enhance recognition memory, spatial working memory and attentional accuracy; in a double-blind cross-over Phase-1 study the same compound sped psychomotor responding and stroop task performance in healthy volunteers.
Some great, well written and cited write ups I would encourage you to read if you have the time:
I did a high-level exploration of the field a few years ago. It was rushed and optimized more for getting it out there than rigor and comprehensiveness, but hopefully still a decent starting point.
I personally think you’d wanna first look at the dozens of molecules known to improve one or another aspect of cognition in diseases (e.g. Alzheimer’s and schizophrenia), that were never investigated for mind enhancement in healthy adults.
Given that some of these show very promising effects (and are often literally approved for cognitive enhancement in diseased populations), given that many of the best molecules we have right now were initially also just approved for some pathology (e.g. methylphenidate, amphetamine, modafinil), and given that there is no incentive for the pharmaceutical industry to conduct clinical trials on healthy people (FDA etc. do not recognize healthy enhancement as a valid indication), there seems to even be a sort of overhang of promising molecule candidates that were just never rigorously tested for healthy adult cognitive enhancement.
Thanks. Seems worth looking into more. I googled the first few on your list, and they’re all described as working via some neurotransmitter / receptor type, either agonist / antagonist / reuptake inhibition. Not everything on the list is like that (I recognize gingko biloba as being related to blood flow). But I don’t think these sorts of things would stack at all, or even necessarily help much with someone who isn’t sick / has some big imbalance or whatever.
My hope for something like this existing is a bit more specific. It comes from thinking that there should be small levers with large effects, because natural development probably pulls some such levers which activate specific gene regulatory networks at different points—e.g. first we pull the [baby-type brain lever], then the [5 year old brain lever], etc.
AFAIK pharmaceutical research is kind of at an impasse because virtually all the small molecules that are easily delivered and have any chance to do anything have been tested and are either found useful or not. New pharmaceuticals need to explore more complex chemical spaces, like artificial antibodies. So I think if there was anything simple that has this effect (the way, say, caffeine makes you wake up) we would know.
Fair, though generally I conflated them because if your molecules aren’t small, due to sheer combinatorics the set of the possible candidates becomes exponentially massive. And then the question is “ok but where are we supposed to look, and by which criterion?”.
Thanks. One of the first places I’d look would be hormones, which IIUC don’t count as small molecules? Though maybe natural hormones have already been tried? But I’d wonder about more obscure or risky ones, e.g. ones normally only active in children.
When you and another person have different concepts of what’s good.
When both of you have the same concepts of what’s good but different models of how to get there.
This happens a lot when people are perfectionist and have aesthetic preferences for work being done in a certain way.
This happens in companies a lot. AI will work in those contexts and will be deceptive if it wants to do useful work. Actually maybe not, the dynamics will be different, like AI being neutral in some way like anybody can turn honesty mode and ask it anything.
Anyway, I think because of the way companies are structured and how humans work being slightly deceptive allows you to do useful work (I think it’s pretty intuitive for anyone who worked in a corporation or watched the office)
I don’t get the down votes. I do think it’s extremely simple—look at politics in general or even workplace politics, just try to google it, there even wikipedia pages roughly about what I want to talk about. I have experienced a situation where I need to do my job and my boss makes it harder for me in some way many times—being not completely honest is an obvious strategy and it’s good for the company you are working at
I think the downvotes is because the correct statement is something more like “In some situations, you can do more useful work by being deceptive.” I think this is actually what you argue for, but it’s very different from “To do useful work you need to be deceptive.”
If “To do useful work you need to be deceptive.” this means that one can’t do useful work without being deceptive. This is clearly wrong.
The more AI companies suppress AI via censorship, the bigger the black market for completely uncensored models will be. Their success is therefore digging our own grave. In other words, mundane alignment has a net negative effect.
The confusion (in popular press, not so much among professionals or here) between censorship and alignment is a big problem. Censorship and hamfisted late-stage RL is counterproductive to alignment, both for the reason you give (increases demand for grey-market tools) and because it makes serious misalignment much less easy to notice.
Sometimes people talk about how AIs will be very superhuman at a bunch of (narrow) domains. A key question related to this is how much this generalizes. Here are two different possible extremes for how this could go:
It’s effectively like an attached narrow weak AI: The AI is superhuman at things like writing ultra fast CUDA kernels, but from the AI’s perspective, this is sort of like it has a weak AI tool attached to it (in a well integrated way) which is superhuman at this skill. The part which is writing these CUDA kernels (or otherwise doing the task) is effectively weak and can’t draw in a deep way on the AI’s overall skills or knowledge to generalize (likely it can shallowly draw on these in a way which is similar to the overall AI providing input to the weak tool AI). Further, you could actually break out these capabilities into a separate weak model that humans can use. Humans would use this somewhat less fluently as they can’t use it as quickly and smoothly due to being unable to instantaneously translate their thoughts and not being absurdly practiced at using the tool (like AIs would be), but the difference is ultimately mostly convenience and practice.
Integrated superhumanness: The AI is superhuman at things like writing ultra fast CUDA kernels via a mix of applying relatively general (and actually smart) abilities, having internalized a bunch of clever cognitive strategies which are applicable to CUDA kernels and sometimes to other domains, as well as domain specific knowledge and heuristics. (Similar to how humans learn.) The AI can access and flexibly apply all of the things it learned from being superhuman at CUDA kernels (or whatever skill) and with a tiny amount of training/practice it can basically transfer all these things to some other domain even if the domain is very different. The AI is at least as good at understanding and flexibly applying what it has learned as humans would be if they learned the (superhuman) skill to the same extent (and perhaps the AIs are actually much better at this than humans). You can’t separate these capabilities into a weak model, the weak model RL’d on this (and distilled into) would either be much worse at CUDA or would need to actually be generally quite capable (rather than weak).
My sense is that the current frontier LLMs are much closer to (1) than (2) for most of their skills, particularly the skills which they’ve been heavily trained on (e.g. next token prediction or competitive programming). As AIs in the current paradigm get more capable, they appear to shift some toward (2) and I expect that at the point when AIs are capable of automating virtually all cognitive work that humans can do, we’ll be much closer to (2). That said, it seems likely that powerful AIs built in the current paradigm[1] which otherwise match humans at downstream performance will somewhat lag behind humans in integrating/generalizing skills they learn (at least without spending a bunch of extra compute on skill integration) because this ability currently seems to be lagging behind other capabilities relative to humans and AIs can compensate for worse skill integration with other advantages (being extremely knowledgeable, fast speed, parallel training on vast amounts of relevant data including “train once, deploy many”, better memory, faster and better communication, etc).
I think different views about the extent to which future powerful AIs will deeply integrate their superhuman abilities versus these abilities being shallowly attached partially drive some disagreements about misalignment risk and what takeoff will look like.
I suppose that most tasks that an LLM can accomplish could theoretically be performed more efficiently by a dedicated program optimized for that task (and even better by a dedicated physical circuit). Hypothesis 1) amounts to considering that such a program, a dedicated module within the model, is established during training. This module can be seen as a weak AI used as a tool by the stronger AI. A bit like how the human brain has specialized modules that we (the higher conscious module) use unconsciously (e.g., when we read, the decoding of letters is executed unconsciously by a specialized module).
We can envision that at a certain stage the model becomes so competent in programming that it will tend to program code on the fly, a tool, to solve most tasks that we might submit to it. In fact, I notice that this is already increasingly the case when I ask a question to a recent model like Claude Sonnet 3.7. It often generates code, a tool, to try to answer me rather than trying to answer the question ‘itself.’ It clearly realizes that dedicated code will be more effective than its own neural network. This is interesting because in this scenario, the dedicated module is not generated during training but on the fly during normal production operation. In this way, it would be sufficient for AI to become a superhuman programmer to become superhuman in many domains thanks to the use of these tool-programs. The next stage would be the on-the-fly production of dedicated physical circuits (FPGA, ASIC, or alien technology), but that’s another story.
This refers to the philosophical debate about where intelligence resides: in the tool or in the one who created it? In the program or in the programmer? If a human programmer programs a superhuman AI, should we attribute this superhuman intelligence to the programmer? Same question if the programmer is itself an AI? It’s the kind of chicken and egg debate where the answer depends on how we divide the continuity of reality into discrete categories. You’re right that integration is an interesting criterion as it is a kind of formal / non arbitrary solution to this problem of defining discrete categories among the continuity of reality.
People also disagree greatly about how much humans tend towards integration rather than non-integration, and how much human skill comes from domain transfer. And I think some / a lot of the beliefs about artificial intelligence are downstream of these beliefs about the origins of biological intelligence and human expertise, i.e., in Yudkowsky / Ngo dialogues. (Object level: Both the LW-central and alternatives to the LW-central hypotheses seem insufficiently articulated; they operate as a background hypothesis too large to see rather than something explicitly noted, imo.)
People also disagree greatly about how much humans tend towards integration rather than non-integration, and how much human skill comes from domain transfer.
Makes me wonder whether most of what people believe to be “domain transfer” could simply be IQ.
I mean, suppose that you observe a person being great at X, then you make them study Y for a while, and it turns out that they are better at Y than an average person who spend the same time studying Y.
One observer says: “Clearly some of the skills at X have transferred to the skills of Y.”
Another observer says: “You just indirectly chose a smart person (by filtering for high skills at X), duh.”
This seems important to think about, I strong upvoted!
As AIs in the current paradigm get more capable, they appear to shift some toward (2) and I expect that at the point when AIs are capable of automating virtually all cognitive work that humans can do, we’ll be much closer to (2).
I’m not sure that link supports your conclusion.
First, the paper is about AI understanding its own behavior. This paper makes me expect that a CUDA-kernel-writing AI would be able to accurately identify itself as being specialized at writing CUDA kernels, which doesn’t support the idea that it would generalize to non-CUDA tasks.
Maybe if you asked the AI “please list heuristics you use to write CUDA kernels,” it would be able to give you a pretty accurate list. This is plausibly more useful for generalizing, because if the model can name these heuristics explicitly, maybe it can also use the ones that generalize, if they do generalize. This depends on 1) the model is aware of many heuristics that it’s learned, 2) many of these heuristics generalize across domains, and 3) it can use its awareness of these heuristics to successfully generalize. None of these are clearly true to me.
Second, the paper only tested GPT-4o and Llama 3, so the paper doesn’t provide clear evidence that more capable AIs “shift some towards (2).” The authors actually call out in the paper that future work could test this on smaller models to find out if there are scaling laws—has anybody done this? I wouldn’t be too surprised if small models were also able to self-report simple attributes about themselves that were instilled during training.
Fair, but I think the AI being aware of its behavior is pretty continuous with being aware of the heuristics it’s using and ultimately generalizing these (e.g., in some cases the AI learns what code word it is trying to make the user say which is very similar to being aware of any other aspect of the task it is learning). I’m skeptical that very weak/small AIs can do this based on some other papers which show they fail at substantially easier (out-of-context reasoning) tasks.
I think most of the reason why I believe this is improving with capabilities is due to a broader sense of how well AIs generalize capabilities (e.g., how much does o3 get better at tasks it wasn’t trained on), but this paper was the most clearly relevant link I could find.
I’m not sure o3 does get significantly better at tasks it wasn’t trained on. Since we don’t know what was in o3′s training data, it’s hard to say for sure that it wasn’t trained on any given task.
To my knowledge, the most likely example of a task that o3 does well on without explicit training is GeoGuessr. But see this Astral Codex Ten post, quoting Daniel Kang:[1]
We also know that o3 was trained on enormous amounts RL tasks, some of which have “verified rewards.” The folks at OpenAI are almost certainly cramming every bit of information with every conceivable task into their o-series of models! A heuristic here is that if there’s an easy to verify answer and you can think of it, o3 was probably trained on it.
I think this is a bit overstated, since GeoGuessr is a relatively obscure task, and implementing an idea takes much longer than thinking of it.[2] But it’s possible that o3 was trained on GeoGuessr.
The same ACX post also mentions:
On the other hand, the DeepGuessr benchmark finds that base models like GPT-4o and GPT-4.1 are almost as good as reasoning models at this, and I would expect these to have less post-training, probably not enough to include GeoGuessr
Do you have examples in mind of tasks that you don’t think o3 was trained on, but which it nonetheless performs significantly better at than GPT-4o?
I think different views about the extent to which future powerful AIs will deeply integrate their superhuman abilities versus these abilities being shallowly attached partially drive some disagreements about misalignment risk and what takeoff will look like.
I think this might be wrong when it comes to our disagreements, because I don’t disagree with this shortform.[1] Maybe a bigger crux is how valuable (1) is relative to (2)? Or the extent to which (2) is more helpful for scientific progress than (1)?
I don’t think this explains our disagreements. My low confidence guess is we have reasonably similar views on this. But, I do think it drives parts of some disagreements between me and people who are much more optimisitic than me (e.g. various not-very-concerned AI company employees).
I agree the value of (1) vs (2) might also be a crux in some cases.
Is the crux that the more optimistic folks plausibly agree (2) is cause for concern, but believe that mundane utility can be reaped with (1), and they don’t expect us to slide from (1) into (2) without noticing?
(I don’t know how to better organize my thoughts and discoveries, and also suspect that it’s better to wait until I master speedreading, but I think it may worth to just share/ask about this one my big confusion right now as quick take)
When I was younger I considered obvious how human mind works, there were such components of it like imagination, memory etc. And of course, thoughts were words. How can it be at all possible to think not in words?
Discovery by mere observation
But some time ago I read “You are surely joking mister Feynman” for the third time, and it finally succeeded to make me go and just look into reality. What I found… Is that all my life was a lie. The Facts aren’t things which are being produced in labs via some elaborated tools and then going down through hierarchy of authority and being whispered to you by Teachers in school or by Experts on youtube.
Facts are just things which emerge when someone observes reality or deduces from information. I mean, of course I knew that you can just see things in everyday life. But these were mere mundane things, things which are from another category which can’t be compared with True Scientific Knowledge.
In the end, you can’t just introspect a little and go publish scientific papers about how human vision or thinking works, aren’t you? You can’t discover true knowledge by doing some such easy, unserious, accessible to everyone thing like… observation. Or thinking.
No, really, truths are things which said you by authority, not things which you can just… see. In the school you are said some arcane truths which came from experts, not just said to… observe a little.
And I have seen people who dared to trust their eyes more than words of scientists, they believed in dreams prophesying future and all other sorts of crazy things. It would be a terrible fate to become one of those.
But now I was just looking at reality and discovering information which could be on wikipedia page about how eye or mind work. Just by observation. Without any eye motion detectors or other tools. And it was really clear that these knowledges aren’t any qualitative way worse than those from tools in labs or authoritative sources of knowledge.
Eventually, my brain just got non standard input of direct observation and I found out that I am a type of mind who believes his own eyes more than authoritative people. Though well, it was actually new information, before I genuinely didn’t know that just by my own eyes and brain I can discover such things that usually said to you by authoritative people.
Speed of thinking
And main thing I started to observe was my mind. And I found out that I was wrong all this time about how it works, which are it’s constraints. It was like be a 12 dimensional creature but always move only in 3, and then find all these degrees of freedom and start to unfold.
And there were much much more things than I am able to say in one post even in short. But there is one thing which confuses me the most. I found out that I can think not only in words. That was strange feeling, more like in dreams where you have some incredible mental experience and then wake up and find that it was complete nonsense.
Thing which I was, as it looked, observing, was that I can think not in words, but some strange indescribable way without them. And even more, it felt multiple times faster than thinking in words. And I had that wild idea of “speedthinking”, analogous to speedreading possibility that if you are able to process others ideas in 5x speed just by using visual modality instead of audial, then maybe it is also possible to process your own ideas that fast?
And the problem was that it was too wild. If it would be possible, then… People who think 5 times faster will be like some 1000 years old vampires, they will shine, they be super fast and noticeable, at 20 they will have more cumulative knowledge than usual people at 100, and so, ever in the life.
It will be super noticeable. And then everyone will go and also learn to think visually. That wasn’t our world. Was it possible that I discovered a thing that was never or almost never discovered by anybody at all?
But then when I was practicing foreign language, I found that when I am trying to find a word, I can think without words. It wasn’t just a daydream. Conceptual thinking was real. And so I decided to just test was it possible to learn it as a skill and think 5 times faster.
Unfortunately it took much much longer to test than I thought. But now I am sure that conceptual thinking is possible, I just can do that. And I found that there were much more properties of thinking mode than just that one.
But I am still very confused by the question of which thinking is usual. People are usually by some strange reason can’t report a detailed introspection about how they think. I also tried to ask Grok about that but answers also failed to form a picture.
On the one hand, it looks like people are used to think much faster than I was thinking in words. But on the other hand, average speed of reading is 200 wpm and people usually do that by subvocalization. Which is really strange if they think not in words, or in words, but by hearing them, not pronouncing, and so much faster.
But what about unusual people? Here into my mind immediately comes up Eliezer Yudkowsky who in his glowfics gave probably the most detailed (or maybe, object level instead of metaphorical) descriptions of work of the mind I saw.
And he definitely talks like thoughts work by sequential mental audial words with their length being a cap. And also he talks about speedreading in the way of that actual restriction is speed of your mind, not speed of your body.
And I have now some understanding about how mind can work after some training, conceptual and visual thinking dozens times faster than audial by pronouncing is certainly possible. But I am still really confused how thinking works usually.
And well, also about its significance. Is it actually such a big advantage if you can think 30 times faster than somebody else, as I thought initially?
Is it actually such a big advantage if you can think 30 times faster than somebody else, as I thought initially?
Depends. I would expect that the difference is not only in speed, but also in precision, or maybe suitability for different kinds of problems.
It seems to me that for me “inner dialogue” works better for figuring out complicated stuff or coming to unusual conclusions, “visualization” works for geometrical and mechanical problems, “feeling” for choosing between clearly given options, and there is also something in between that I use for medium-difficulty situations.
I claim it is a lot more reasonable to use the reference class of “people claiming the end of the world” than “more powerful intelligences emerging and competing with less intelligent beings” when thinking about AI x-risk. further, we should not try to convince people to adopt the latter reference class—this sets off alarm bells, and rightly so (as I will argue in short order) - but rather to bite the bullet, start from the former reference class, and provide arguments and evidence for why this case is different from all the other cases.
this raises the question: how should you pick which reference class to use, in general? how do you prevent reference class tennis, where you argue back and forth about what is the right reference class to use? I claim the solution is you want to use reference classes that have consistently made good decisions irl. the point of reference classes is to provide a heuristic to quickly apply judgement to large swathes of situations that you don’t have time to carefully examine. this is important because otherwise it’s easy to get tied up by bad actors who avoid being refuted by making their beliefs very complex and therefore hard to argue against.
the big problem with the latter reference class is it’s not like anyone has had many experiences using it to make decisions ex ante, and if you squint really hard to find day to day examples, they don’t all work out the same way. smarter humans do mostly tend to win over less smart humans. but if you work at a zoo, you will almost always be more worried about physical strength and aggressiveness when putting different species in the same enclosure. if you run a farm (or live in Australia), you’re very worried about relatively dumb invasive animals like locusts and rabbits.
on the other hand, everyone has personally experienced a dozen different doomsday predictions. whether that’s your local church or faraway cult warning about Armageddon, or Y2K, or global financial collapse in 2008, or the maximally alarmist climate people, or nuclear winter, or peak oil. for basically all of them, the right action empirically in retrospect was to not think too much about it. there are many concrete instances of people saying “but this is different” and then getting burned.
and if you allow any reference class to be on as strong a footing as very well established reference classes, then you open yourself up to getting pwned ideologically. “all complex intricate objects we have seen created have been created by something intelligent, therefore the universe must also have an intelligent creator.” it’s a very important memetic defense mechanism.
(to be clear this doesn’t mean you can only believe things others believe, or that humans taking over earth is not important evidence, or that doomsday is impossible!! I personally think AGI will probably kill everyone. but this is a big claim and should be treated as such. if we don’t accept this, then we will forever fail to communicate with people who don’t already agree with us on AGI x-risk.)
The reference classes you should use work as a heuristic because there is some underlying mechanism that makes them work. So you should use reference classes in situations where their underlying mechanism is expected to work.
Maybe the underlying mechanism of doomsday predictions not working is that people predicting doom don’t make their predictions based on valid reasoning. So if someone uses that reference class to doubt AI risk, this should be judged as them making a claim about reasoning of people predicting AI doom being similar to people in cults predicting Armageddon.
for basically all of them, the right action empirically in retrospect was to not think too much about it.
False?
Climate change tail scenarios are worth studying and averting. Nuclear winter was obviously worth studying and averting back in the Cold War, and still is today. 2008 financial crisis was worth studying and averting.
Do you not believe average citizens can study issues like these and make moves to solve them?
This is kind of missing the point of Bayes. One shouldn’t “choose” a reference class to update on. One should update to the best of your ability on the whole distribution of hypotheses available to describe the situation. Neither is a ‘right’ or ‘wrong’ reference class to use, they’re both just valid pieces of evidence about base rates, and you should probably be using both of them.
It seems you are having in mind something like inference to the best explanation here. Bayesian updating, on the other hand, does need a prior distribution, and the question of which prior distribution to use cannot be waved away when there is a disagreement on how to update. In fact, that’s one of the main problems of Bayesian updating, and the reason why it is often not used in arguments.
I’m not really sure what that has to do with my comment. My point is the original post seemed to be operating as if you look for the argmax reference class, you start there, and then you allow arguments. My point isn’t that their prior is wrong, it’s that this whole operation is wrong.
I think also you’re maybe assuming I’m saying the prior looks something like {reference class A, reference class B} and arguing about the relative probability of each, but it doesn’t, a prior should be over all valid explanations of the prior evidence. Reference classes come in because they’re evidence about base rates of particular causal structures; you can say ‘given the propensity for the world to look this way, how should I be correcting the probability of the hypotheses under consideration? Which new hypotheses should I be explicitly tracking?’
I can see where the original post might have gone astray. People have limits on what they can think about and it’s normal to narrow one’s consideration to the top most likely hypothesis. But it’s important to be aware of what you’re approximating here, else you get into a confusion where you have two valid reference classes and you start telling people that there’s a correct one to start arguing from.
I agree this is an interesting philosophical question but again I’m not sure why you’re bringing it up.
Given your link maybe you think me mentioning Bayes was referring to some method of selecting a single final hypothesis? I’m not, I’m using it to refer to the Bayesian update rule.
It seems the updating rule doesn’t tell you anything about the original argument even when you view information about reference classes as evidence rather than as a method of assigning prior probabilities to hypotheses. Or does it? Can you rephrase the argument in a proper Bayesian way such that it becomes clearer? Note that how strongly some evidence confirms or disconfirms a hypothesis also depends on a prior.
What argument are you referring to when you say “doesn’t tell you anything about the original argument”?
My framing is basically this: you generally don’t start a conversation with someone as a blank pre-priors slate that you get to inject your priors into. The prior is what you get handed, and then the question is how people should respond to the evidence and arguments available. Well, you should use (read: approximate) the basic Bayesian update rule: hypotheses where an observation is unlikely are that much less probable.
I think you’re underestimating the inferential gap here. I’m not sure why you’d think the Bayes updating rule is meant to “tell you anything about” the original post. My claim was that the whole proposal about selecting reference classes was framed badly and you should just do (approximate) Bayes instead.
You’re having a conversation with someone. They believe certain things are more probable than other things. They mention a reference class: if you look at this grouping of claims, most of them are wrong. Then you consider the set of hypotheses: under each of them, how plausible is it given the noted tendency for this grouping of claims to be wrong? Some of them pass easily, eg. the hypothesis that this is just another such claim. Some of them less easily; they are either a modal part of this group and uncommon on base rate, or else nonmodal or not part of the group at all. You continue, with maybe a different reference class, or an observation about the scenario.
Hopefully this illustrates the point. Reference classes are just evidence about the world. There’s no special operation needed for them.
I think the group of people “claiming the end of the world” in the case of AI x-risk is importantly more credentialed and reasonable-looking than most prior claims about the end of the world. From the reference class and general heuristics perspective that you’re talking about[1], I think how credible looking the people are is pretty important.
So, I think the reference class is more like claims of nuclear armageddon than cults. (Plausibly near maximally alarmist climate people are in a similar reference class.)
I agree this reference class is better, and implies a higher prior, but I think it’s reasonable for the prior over “arbitrary credentialed people warning about something” to be still relatively low in an absolute sense- lots of people have impressive sounding credentials that are not actually good evidence of competence (consider: it’s basically a meme at this point that whenever you see a book where the author puts “PhD” after their name, they probably are a grifter / their phd was probably kinda bs), and also there is a real negativity bias where fearmongering is amplified by both legacy and social media. Also, for the purposes of understanding normal people, it’s useful to keep in mind that trust in credentials and institutions is not very high right now in the US among genpop.
I endeavor to look at how things work and describe them accurately. Similarly to how I try to describe how a piece of code works, or how to to build a shed, I will try to accurately describe the consequences of large machine learning runs, which can include human extinction.
I personally think AGI will probably kill everyone. but this is a big claim and should be treated as such.
This isn’t how I think about things. Reality is what exists, and if a claim accurately describes reality, then I should not want to hold it to higher standards than claims that do not describe reality. I don’t think it’s a good epistemology to rank claims by “bigness” and then say that the big ones are less likely and need more evidence. On the contrary, I think it’s worth investing more in finding out if they’re right, and generally worth bringing them up to consideration with less evidence than for “small” claims.
on the other hand, everyone has personally experienced a dozen different doomsday predictions. whether that’s your local church or faraway cult warning about Armageddon, or Y2K, or global financial collapse in 2008, or the maximally alarmist climate people, or nuclear winter, or peak oil. for basically all of them, the right action empirically in retrospect was to not think too much about it.
I don’t have the experiences you’re describing. I don’t go to churches, I don’t visit cults, I was 3yrs old in the year 2000, I was 11 for the ’08 financial crash and having read about it as an adult I don’t recall extinction being a topic of discussion, I think I have heard of climate people saying that via alarmist news headlines but I have not had anyone personally try to convince me of this or even say that they believe it. I have heard it discussed for nuclear winter, yes, and I think nukes are quite scary and it was reasonable to consider, I did not dismiss it out of hand and wouldn’t use that heuristic. I don’t know what the oil thing is.
In other words, I don’t recall anyone seriously trying to convince me that the world was ending except in cases where they had good reason to believe it. In my life, when people try to warn me about big things, especially if they’ve given it serious thought, usually I’ve found it’s been worthwhile for me to consider it. (I like to think I am good at steering clear of scammers and cranks, so that I can trust the people in my life when they tell me things.)
The sense I get from this post is that, in it, you’re assuming everyone else in the world is constantly being assaulted with claims meant to scare and control them rather than people attempting to describe the world accurately. I agree there are forces doing that, but I think this post gives up all too quickly on there being other forces in the world that aren’t doing that that people can recognize and trust.
i am also trying to accurately describe reality. what i’m saying is, even from the perspective of someone smart and truth-seeking but who doesn’t know much about the object-level, it is very reasonable to use bigness of claim as a heuristic for how much evidence you need before you’re satisfied, and that if you don’t do this, you will be worse at finding the truth in practice. my guess is this applies even more so to the average person.
i think this is very analogous to occam’s razor / trust region optimization. clearly, we need to discount theories based on complexity because there are exponentially more complex theories compared to simple ones, many of which have no easily observable difference to the simpler ones, opening you up to being pwned. and empirically it seems a good heuristic to live life by. complex theories can still be true! but given two theories that both accurately describe reality, you want the simpler one. similarly, given two equally complex claims that accurately describe the evidence, you want the one that is less far fetched from your current understanding of the world / requires changing less of your worldview.
also, it doesn’t have to be something you literally personally experienced. it’s totally valid to read the wikipedia page on the branch davidians or whatever and feel slightly less inclined to take things that have similar vibes seriously, or even to absorb the vibe from your environs (your aversion to scammers and cranks surely did not come ex nihilo, right?)
for most of the examples i raised, i didn’t necessarily mean the claim was literally 100% human extinction, and i don’t think it matters that it wasn’t. first, because the important thing is the vibe of the claim (catastrophic) - since we’re talking about heuristics on how seriously to take things that you don’t have time to deep dive on, the rule has to be relatively cheap to implement. i think most people, even quite smart people, genuinely don’t feel much of an emotional difference between literal human extinction vs collapse of society vs half of people dying painfully, unless they first spend a half hour carefully thinking about the implications of extinction. (and even then depending on their values they may still not feel a huge difference)
also, it would be really bad if you could weasel your way out of a reference class that easily; it would be rife for abuse by bad actors—“see, our weird sect of christianity claims that after armageddon, not only will all actual sinners’ souls be tortured forever, but that the devil will create every possible sinner’s soul to torture forever! this is actually fundamentally different from all existing christian theories, and it would be unfathomably worse, so it really shouldn’t be thought of as the same kind of claim”
even if most people are trying to describe the world accurately (which i think is not true and we only get this impression because we live in a strange bubble of very truth seeking people + are above-average capable at understanding things object level and therefore quickly detecting scams), ideas are still selected for memeticness. i’m sure that 90% of conspiracy theorists genuinely believe that humanity is controlled by lizards and are trying their best to spread what they believe to be true. many (not all) of the worst atrocities in history have been committed by people who genuinely thought they were on the side of truth and good.
(actually, i think people do get pwned all the time, even in our circles. rationalists are probably more likely than average (controlling for intelligence) to get sucked into obviously culty things (e.g zizians), largely because they don’t have the memetic antibodies needed to not get pwned, for one reason or another. so probably many rationalists would benefit from evaluating things a little bit more on vibes/bigness and a little bit less on object level)
Your points about Occam’s razor have got nothing to do with this subject[1]. The heuristic “be more skeptical of claims that would have big implications if true” makes sense only when you suspect a claim may have been adversarially optimized for memetic fitness; it is not otherwise true that “a claim that something really bad is going to happen is fundamentally less likely to be true than other claims”.
I’m having a little trouble connecting your various points back to your opening paragraph, which is the primary thing that I am trying to push back on.[2]
I claim it is a lot more reasonable to use the reference class of “people claiming the end of the world” than “more powerful intelligences emerging and competing with less intelligent beings” when thinking about AI x-risk. further, we should not try to convince people to adopt the latter reference class—this sets off alarm bells, and rightly so (as I will argue in short order) - but rather to bite the bullet, start from the former reference class, and provide arguments and evidence for why this case is different from all the other cases.
To restate the message I’m reading here: “Give up on having a conversation where you evaluate the evidence alongside your interlocutors. Instead frame yourself as trying to convince them of something, and assume that they are correct to treat your communications as though you are adversarially optimizing for them believing whatever you want them to believe.” This assumption seems to give up a lot of my ability to communicate with people (almost ~all of it), and I refuse to simply do it because some amount of communication in the world is adversarially optimized, and I’m definitely not going to do it because of a spurious argument that Occam’s razor implies that “claims about things being really bad or claims that imply you need to take action are fundamentally less likely to be true”.
You are often in an environment where people are trying to use language to describe reality, and in that situation the primary thing to evaluate is not the “bigness” of a claim, but the evidence for and against it. I recommend instead to act in such a way as to increase the size and occurrence of that environment more-so than “act as though it’s correct to expect maximum adversarial optimization in communications”.
(Meta: The only literal quotes of Leo’s in this comment are the big one in the quote block, my use of “” is to hold a sentence as object, they are not things Leo wrote.)
I agree that the more strongly a claim implies that you should take action, then the more you should consider that it is being optimized adversarially for you to take action. For what it’s worth, I think that heuristic applies more so to claims that you should personally take action. Most people have little action to directly prevent the end of the world from AI; this is a heuristic more naturally applied to claims that you need to pay fines (which are often scams/spam). But mostly, when people give me claims that imply action, they are honestly meant claims and I do the action. This is the vast majority of my experience.
Aside to Leo: Rather than reply point-by-point to the each of the paragraphs in the second comment, I will try restating and responding to the core message I got in the opening paragraph of the first comment. I’m doing this because the paragraphs in the second-comment seemed somewhat distantly related / I couldn’t tell whether the points were actually cruxy. They were responding to many different things, and I hope restating the core thing will better respond to your core point. However I don’t mean to avoid key arguments, if you think I have done so feel free to tell me one or two paragraphs you would especially like me to engage with and I will do so in any future reply.
in practice many of the claims you hear will be optimized for memetic fitness, even if the people making the claims are genuine. well intentioned people can still be naive, or have blind spots, or be ideologically captured.
also, presumably the people you are trying to convince are on average less surrounded by truth seeking people than you are (because being in the alignment community is strongly correlated with caring about seeking truth).
i don’t think this gives up your ability to communicate with people. you simply have to signal in some credible way that you are not only well intentioned but also not merely the carrier of some very memetic idea that slipped past your antibodies. there are many ways to accomplish this. for example, you can build up a reputation of being very scrupulous and unmindkilled. this lets you convey ideas freely to other people in your circles that are also very scrupulous and unmindkilled. when interacting with people outside this circle, for whom this form of reputation is illegible, you need to find something else. depending on who you’re talking to and what kinds of things they take seriously, this could be leaning on the credibility of someone like geoff hinton, or of sam/demis/dario, or the UK government, or whatever.
this might already be what you’re doing, in which case there’s no disagreement between us.
You’re writing lots of things here but as far as I can tell you aren’t defending your opening statement, which I believe is mistaken.
I claim it is a lot more reasonable to use the reference class of “people claiming the end of the world” than “more powerful intelligences emerging and competing with less intelligent beings” when thinking about AI x-risk. further, we should not try to convince people to adopt the latter reference class—this sets off alarm bells, and rightly so (as I will argue in short order) - but rather to bite the bullet, start from the former reference class, and provide arguments and evidence for why this case is different from all the other cases.
Firstly, it’s just not more reasonable. When you ask yourself “Is a machine learning run going to lead to human extinction?” you should not first say “How trustworthy are people who have historically claimed the world is ending?”, you should of course primarily bring your attention to questions about what sorts of machine is being built, what sort of thinking capacities it has, what sorts of actions it can take in the world, what sorts of optimization it runs, how it would behave around humans if it were more powerful than them, and so on. We can go back to discussing epistemology 101 if need be (e.g. “Hug the Query!”).
Secondly, insofar as someone believes you are a huckster or a crackpot, you should leave the conversation, communication here has broken down and you should look for other communication opportunities. However, insofar as someone is only evaluating this tentatively as one of many possible hypotheses about you then you should open yourself up to auditing / questioning by them about why you believe what you believe and your past history and your memetic influences. Being frank is the only way through this! But you shouldn’t say to them “Actually, I think you should treat me like a huckster/scammer/serf-of-a-corrupt-empire.” This feels analogous to a man on a date with a woman saying “Actually I think you should strongly privilege the hypothesis that I am willing to rape you, and now I’ll try to provide evidence for you that this is not true.” It would be genuinely a bad sign about a man that he thinks that about himself, and also he has moved the situation into a much more adversarial frame.
I suspect you could write some more narrow quick-take such as “Here is some communication advice I find helpful when talking with friends and colleagues about how AI can lead to human extinction”, but in generalizing it all the way to making dictates about basic epistemology you are making basic mistakes and getting it wrong.
Please either (1) defend and/or clarify the original statement, or (2) concede that it was mistaken, rather than writing more semi-related paragraphs about memetic immune systems.
Firstly, it’s just not more reasonable. When you ask yourself “Is a machine learning run going to lead to human extinction?” you should not first say “How trustworthy are people who have historically claimed the world is ending?”
But you should absolutely ask “does it look like I’m making the same mistakes they did, and how would I notice if it were so?” Sometimes one is indeed in a cult with your methods of reason subverted, or having a psychotic break, or captured by a content filter that hides the counterevidence, or many of the more mundane and pervasive failures in kind.
I am confused why you think my claims are only semi related. to me my claim is very straightforward, and the things i’m saying are straightforwardly converying a world model that seems to me to explain why i believe my claim. i’m trying to explain in good faith, not trying to say random things. i’m claiming a theory of how people parse information, to justify my opening statement, which i can clarify as:
sometimes, people use the rhetorical move of saying something like “people think 95% doom is overconfident, yet 5% isn’t. but that’s also being 95% confident in not-doom, and yet they don’t consider that overconfident. curious.” followed by “well actually, it’s only a big claim under your reference class. under mine, i.e the set of all instances of a more intelligent thing emerging, actually, 95% doom is less overconfident than 5% doom” this post was inspired by seeing one such tweet, but i see such claims like this every once in a while that play reference class tennis.
i think this kind of argument is really bad at persuading people who don’t already agree (from empirical observation). my opening statement is saying “please stop doing this, if you do it, and thank you for not doing this, if you dont already do it” the rest of my paragraphs provide an explanation of my theory for why this is bad for changing people’s minds. this seems pretty obviously relevant for justifying why we should stop doing the thing. i sometimes see people out there talk like this (including my past self at some point), and then fail to convince people, and then feel very confused about why people don’t see the error of their ways when presented with an alternative reference class. if my theory is correct (maybe it isn’t, this isn’t a super well thought out take, it’s more a shower thought), then it would explain this, and people who are failing to convince people would probably want to know why they’re failing. i did not spell this out in my opening statement because i thought it was clear but in retrospect this was not clear from the opening statement
i don’t think the root cause is people being irrational epistemically. i think there is a fundamental reason why people do this that is very reasonable. i think you disagree with this on the object level and many of my paragraphs are attempting to respond to what i view as the reason you disagree. this does not explicitly show up in the opening statement, but since you disagree with this, i thought it would make sense to respond to that too
i am not saying you should explicitly say “yeah i think you should treat me as a scammer until i prove otherwise”! i am also not saying you should try to argue with people who have already stopped listening to you because they think you’re a scammer! i am merely saying we should be aware that people might be entertaining that as a hypothesis, and if you try to argue by using this particular class of rhetorical move, you will only trigger their defenses further, and that you should instead just directly provide the evidence for why you should be taken seriously, in a socially appropriate manner. if i understand correctly, i think the thing you are saying one should do is the same as the thing i’m saying one should do, but phrased in a different way; i’m saying not to do a thing that you seem to already not be doing.
i think i have not communicated myself well in this conversation, and my mental model is that we aren’t really making progress, and therefore this conversation has not brought value and joy into the world in the way i intended. so this will probably be my last reply, unless you think doing so would be a grave error.
The heuristic “be more skeptical of claims that would have big implications if true” makes sense only when you suspect a claim may have been adversarially optimized for memetic fitness; it is not otherwise true that “a claim that something really bad is going to happen is fundamentally less likely to be true than other claims”.
This seems wrong to me.
a. More smaller things happen and there are fewer kinds of smaller thing that happen. b. I bet people genuinely have more evidence for small claims they state than big ones on average. c. The skepticism you should have because particular claims are frequently adversarially generated shouldn’t first depend on deciding to be skeptical about it.
If you’ll forgive the lack of charity, ISTM that leogao is making IMO largely true points about the reference class and then doing the wrong thing with those points, and you’re reacting to the thing being done wrong at the end, but trying to do this in part by disagreeing with the points being made about the reference class. leogao is right that people are reasonable in being skeptical of this class of claims on priors, and right that when communicating with someone it’s often best to start within their framing. You are right that regardless it’s still correct to evaluate the sum of evidence for and against a proposition, and that other people failing to communicate honestly in this reference class doesn’t mean we ought to throw out or stop contributing to the good faith conversations avaialable to us.
a. I expect there is a slightly more complicated relationship between my value-function and the likely configuration states of the universe than literally zero-correlation, but most configuration states do not support life and we are all dead, so in one sense a claim that in the future something very big and bad will happen is far more likely on priors. One might counter that we live in a highly optimized society where things being functional and maintained is an equilibrium state and it’s unlikely for systems to get out of whack enough for bad things to happen. But taking this straightforwardly is extremely naive, tons of bad things happen all the time to people. I’m not sure whether to focus on ‘big’ or ‘bad’ but either way, the human sense of these is not what the physical universe is made out of or cares about, and so this looks like an unproductive heuristic to me.
b. On the other hand, I suspect the bigger claims are more worth investing time to find out if they’re true! All of this seems too coarse-grained to produce a strong baseline belief about big claims or small claims.
c. I don’t get this one. I’m pretty sure I said that if you believe that you’re in a highly adversarial epistemic environment, then you should become more distrusting of evidence about memetically fit claims.
I don’t know what true points you think Leo is making about “the reference class”, nor which points you think I’m inaccurately pushing back on that are true about “the reference class” but not true of me. Going with the standard rationalist advice, I encourage everyone to taboo “reference class” and replace it with a specific heuristic. It seems to me that “reference class” is pretending that these groupings are more well-defined than they are.
c. I don’t get this one. I’m pretty sure I said that if you believe that you’re in a highly adversarial epistemic environment, then you should become more distrusting of evidence about memetically fit claims.
Well, sure, it’s just you seemed to frame this as a binary on/off thing, sometimes you’re exposed and need to count it and sometimes you’re not, whereas to me it’s basically never implausible that a belief has been exposed to selection pressures, and the question is of probabilities and degrees.
i’m not even saying people should not evaluate evidence for and against a proposition in general! it’s just that this is expensive, and so it is perfectly reasonable to have heuristics to decide which things to evaluate, and so you should first prove with costly signals that you are not pwning them, and then they can weigh the evidence. and until you can provide enough evidence that you’re not pwning them for it to be worth their time to evaluate your claims in detail, that it should not be surprising that many people won’t listen to the evidence; and that even if they do listen, if there is still lingering suspicion that they are being pwned, you need to provide the type of evidence that could persuade someone that they aren’t getting pwned (for which being credibly very honest and truth seeking is necessary but not sufficient), which is sometimes different from mere compellingness of argument
I think the framing that sits better to me is ‘You should meet people where they’re at.’ If they seem like they need confidence that you’re arguing from a place of reason, that’s probably indeed the place to start.
I think you’re correct. There’s a synergistic feedback loop between alarmism and social interaction that filters out pragmatic perspectives. Creating the illusion that the doom surrounding any given topic more prevalent than it really is, or even that it’s near universal.
Even before the rise of digital information the feedback phenomenon could be observed in any insular group. In today’s environment where a lot of effort goes into exploiting that feedback loop it requires a conscious effort to maintain perspective, or even remain aware that there are other perspectives.
# AI and the Future of Personalized Education: A Paradigm Shift in Learning
Recently, I’ve been exploring the theory of computation. With the rapid advancement of artificial intelligence—essentially a vast collection of algorithms and computational instructions designed to process inputs and generate outputs—I find myself increasingly curious about the fundamental capabilities and limitations of computation itself. Concepts such as automata, Turing machines, computability, and complexity frequently appear in discussions about AI, yet my understanding of these topics is still developing. I recently encountered fascinating articles by Stephen Wolfram, including [Observer Theory](https://writings.stephenwolfram.com/2023/12/observer-theory/) and [A New Kind of Science: A 15-Year View](https://writings.stephenwolfram.com/2017/05/a-new-kind-of-science-a-15-year-view/). Wolfram presents intriguing ideas, such as the claim that beyond a certain minimal threshold, nearly all processes—natural or artificial—are computationally equivalent in sophistication, and that even the simplest rules (like cellular automaton Rule 30) can produce irreducible, unpredictable complexity.
Before the advent of AI tools, my approach to learning involved selecting a relevant book, reading through it, and working diligently on exercises. A significant challenge in self-directed learning is the absence of immediate guidance when encountering difficulties. To overcome this, I would synthesize information from various sources—books, online resources, and Q&A platforms like Stack Overflow—to clarify my doubts. Although rewarding, as it encourages the brain to form connections and build new knowledge, this process is undeniably time-consuming. Imagine if we could directly converse with the author of a textbook—transforming the author into our personal teacher would greatly enhance learning efficiency.
In my view, an effective teacher should possess the following qualities:
- Expertise in the subject matter, with a depth of knowledge significantly greater than that of the student, and familiarity with related disciplines to provide a comprehensive understanding. - A Socratic teaching style, where the teacher guides students through questions, encourages active participation, corrects misconceptions, and provides constructive feedback. The emphasis should be on the learning process rather than merely arriving at the correct answer. - An ability to recognize and address the student’s specific misunderstandings, adapting teaching methods to suit the student’s individual learning style and level.
Realistically, not all teachers I’ve encountered meet these criteria. Good teachers are scarce resources, which explains why parents invest heavily in quality education and why developed countries typically have more qualified teachers than developing ones.
With the emergence of AI tools, I sense a potential paradigm shift in education. Rather than simply asking AI to solve problems, we can leverage AI as a personalized teacher. For undergraduate-level topics, AI already surpasses the average classroom instructor in terms of breadth and depth of knowledge. AI systems effectively function as encyclopedias, capable of addressing questions beyond the scope of typical educators. Moreover, AI can be easily adapted to employ a Socratic teaching approach. However, current AI still lacks the nuanced ability to fully understand a student’s individual learning style and level. It relies heavily on the learner’s self-awareness and reflection to identify gaps in understanding and logic, prompting the learner to seek clarification. This limitation likely arises because large language models (LLMs) are primarily trained to respond to human prompts rather than proactively prompting humans to think critically.
Considering how AI might reshape education, I offer the following informal predictions:
- AI systems will increasingly be trained specifically as teachers, designed to prompt learners through Socratic questioning rather than simply providing direct answers. A significant challenge will be creating suitable training environments and sourcing data that accurately reflect the learning process. Potential training resources could include textbooks, Q&A platforms like Stack Overflow and Quora, and educational videos from Khan Academy and MIT OpenCourseWare. - AI-generated educational content will become dynamic and personalized, moving beyond traditional chatbot interactions. Similar to human teachers, AI might illustrate concepts through whiteboard explanations, diagrams, or even programming demonstrations. Outputs could include text, images, videos, or interactive web-based experiences. - The number of AI teachers will vastly exceed the number of human teachers, significantly reducing the cost of education. This transformation may occur before 2028, aligning with predictions outlined in [AI-2027](https://ai-2027.com/).
In a hypothetical future where AI can perform every cognitive task, will humans still need to learn? Will we still require teachers? If AI remains friendly and supportive, I believe human curiosity will persist, though the necessity for traditional learning may diminish significantly. Humans might even use AI to better understand AI itself. Conversely, if AI were to become adversarial, perhaps humans would still have roles to fulfill, necessitating AI to teach humans the skills required for these tasks.
Wolfram presents intriguing ideas, such as the claim that beyond a certain minimal threshold, nearly all processes—natural or artificial—are computationally equivalent in sophistication
In my view, an effective teacher should possess the following qualities: [...] Realistically, not all teachers I’ve encountered meet these criteria. Good teachers are scarce resources
As a former teacher, I 100% agree.
why developed countries typically have more qualified teachers than developing ones.
I am not sure I understand this part. How specifically does being a developed country increase the number of teachers able to do the Socratic method etc.? (I could make a guess, but I am interested in your interpretation.)
In 2021, Daniel Ellsberg leaked US govt plans to make a nuclear first strike on China in 1958 due to Taiwan conflict.
Daniel Ellsberg copied these papers more than 50 years ago but only released them now because he thought another conflict over Taiwan may be possible soon.
As usual, seems clear Dulles was more interested in escalating the conflict (in this case, to nuclear) than Eisenhower.
On September 2, Dulles met with members of the Joint Chiefs and other top officials to formulate the basic American position in the crisis and to define American policy in the event cf a Chinese Communist invasion of the Offshore Islands. At this meeting there was consider¬ able debate on the question of to what extent Quemoy could be defended without nuclear weapons and on the more general question of the wisdom of relying on nuclear weapons for deterrence. The consensus reached was that the use of nuclear weapons wo uld ultimately be necessary for the defense of Quemoy, but that the United States should limit itself i nitial ly to using conventional forces
One thing that confused me about transformers is the question of when (as in, after how many layers) each embedding “flips” from representing the original token to finally representing the prediction of the next token.
By now, I think the answer is simply this: each embedding represents both at the same time (and more). For instance, in GPT3 there are 12,288 embedding dimensions. At first I thought that all of them initially encode the original token, and after going through all the layers they eventually all encode the next token, and somewhere in the layers between this shift must happen. But what, upon some reflection, makes much more sense would be something very roughly like, say:
some 1000 dimensions encode the original token
some other 1000 dimensions encode the prediction of the next token
the remaining 10,288 dimensions encode information about all available context (which will start out “empty” and get filled with meaningful information through the layers).
In practice, things are of course much less clean, and probably most dimensions will have some role in all these things, to different degrees, as of course all of this is learned through gradient descent and hence will be very noisy and gradual. Additionally, there’s the whole positional encoding thing which is also part of the embeddings and makes clear distinctions even more difficult. But the key point remains that a single embedding encodes many things, only one of which is the prediction, and this prediction is always there from the beginning (when it’s still very superficial and bad) and then, together with the rest of the embedding, gets refined more and more throughout the layers.
Another misconception I had was that embedding and unembedding are very roughly symmetric operations that just “translate” from token space to embedding space and vice versa[1]. This made sense in relation to the initial & naive “embeddings represent tokens” interpretation, but with the updated view as described above, it becomes clear that unembedding is rather an “extraction” of the information content in the embedding that encodes the prediction.
One piece of evidence for this updated view is that this paper (thanks to Leon Lang for the hint) found that “Zero layer transformers model bigram statistics”. So, indeed, embedding + unembedding alone already perform some very basic next-token prediction. (Admittedly I’m not sure if this is only the case when the transformer is trained with zero layers, or also in, say, GPT3, when during inference you just skip all the layers)
I would guess that transformer-experienced people (unless they disagree with my description—in that case, please elaborate what I’m still getting wrong) will find all of this rather obvious. But for me, this was a major missing piece of understanding, even after once participating in an ML-themed bootcamp and watching all the 3Blue1Brown videos on transformers several times, where this idea either is not directly explained, or I somehow managed to consistently miss it.
Of course, this is not entirely true to begin with because the unembedding yields a distribution rather than a single token. But my assumption was that, if you embed the word “Good” and then unembed the embedding immediately, you would get a very high probability for “Good” back when in practice (I didn’t verify this yet) you would probably obtain high probabilities for “morning”, “day” etc.
Awkwardly, it depends on whether the model uses tied embeddings (unembed is embed transpose) or has separate embed and unembed matrices. Using tied embedding matrices like this means the model actually does have to do a sort of conversion.
Your discussion seems mostly accurate in the case of having separate embed and unembed, except that I don’t think the initial state is like “1k encode current, 1k encode predictions, rest start empty”. The model can just directly encode predictions for an initial state using the unembed.
They have the same dosage as a cup of coffee (~100mg).
You can still drink coffee/Diet Coke/tea, just get it without caffeine. Coke caffeine-free, decaf coffee, herbal tea.
They cost ~60¢ per pill [EDIT: oops, it’s 6¢ per pill — thanks @ryan_greenblatt] vs ~$5 for a cup of coffee — that’s about an order of magnitude cheaper.
You can put them in your backpack or back pocket or car. They don’t go bad, they’re portable, they won’t spill on your clothes, they won’t get cold.
Straight caffeine makes me anxious. L-Theanine makes me less anxious. The caffeine capsules I linked above have equal parts caffeine and L-Theanine.
Also:
Caffeine is a highly addictive drug; you should treat it like one. Sipping a nice hot beverage doesn’t make me feel like I’m taking a stimulant in the way that swallowing a pill does.
I don’t know how many milligrams of caffeine were in the last coffee I drank. But I do know exactly the amount of caffeine in every caffeine pill I’ve ever taken. Taking caffeine pills prevents accidentally consuming way too much (or too little) caffeine.
I don’t want to associate “caffeine” with “tasty sugary sweet drink,” for two reasons:
A lot of caffeinated beverages contain other bad stuff. You might not by-default drink a sugary soft drink if it weren’t for the caffeine, so disambiguating the associations in your head might cause you to eat your caffeine and not drink the soda.
Operant conditioning works by giving positive reinforcement to certain behaviors, causing them to happen more frequently. Like, for instance, giving someone a sugary soft drink every time they take caffeine. But when I take caffeine, I want to to be taking it because of a reasoned decision-making process minimally swayed by factors not under my control. So I avoid giving my brain a strong positive association with something that happens every time it experiences caffeine (e.g. a sugary soft drink). Caffeine is addictive enough! Why should I make the Skinner box stronger?
If you can’t take pills, consider getting caffeine patches — though I’ve never tried them, so can’t give it my personal recommendation.
Caffeine is a drug. I’m not a doctor, take caffeine at your own risk, this is not medical advice.
This post does not take a stance on whether or not you should take caffeine; the stance that it takes is, conditional on your already having decided to take caffeine, you should take it in pill form (instead of in drink form).
~$5 for a cup of coffee — that’s about an order of magnitude cheaper.
Are you buying your coffee from a cafe every day or something? You can buy a pack of nice grounds for like $13, and that lasts more than a month (126 Tbsp/pack / (3 Tbsp/day) = 42 days/pack), totaling 30¢/day. Half the cost of a caffeine pill. And that’s if you don’t buy bulk.
Are you buying your coffee from a cafe every day or something?
i’m not (i don’t buy caffeinated drinks!), but the people i’m responding to in this post are. in particular, i often notice people go from “i need caffeine” → “i’ll buy a {coffee, tea, energy drink, etc}” — for example, college students, most of whom don’t have the wherewithal to go to the effort of making their own coffee.
One question I’m curious about: do these pills have less or no effects on your bowels compared to what a coffee cup can? Is it something about the caffeine in itself, something else, or the mode of absorption? If they ditch those effects then I’m genuinely interested.
If you take caffeine regularly, I also recommend experimenting with tolerance build-up, which the pill form makes easy. You want to figure out the minimal number of days N such that if you don’t take caffeine every N days, you don’t develop tolerance. For me, N turned out to be equal to 2: if I take 100 mg of caffeine every second day, it always seems to have its full effect (or tolerance develops very slowly; and you can “reset” any such slow creep-up by quitting caffeine for e. g. 1 week every 3 months).
You can test that by taking 200 mg at once[1] after 1-2 weeks of following a given intake schedule. If you end up having a strong reaction (jitteriness, etc., it’s pretty obvious, at least in my experience), you haven’t developed tolerance. If the reaction is only about as strong as taking 100 mg on a toleranceless stomach[2], then you have.
(Obviously the real effects are probably not so neatly linear, and it might work for you differently. But I think the overarching idea of testing caffeine tolerance build-up by monitoring whether the rather obvious “too much caffeine” point moved up or not, is an approach with a much better signal/noise ratio than doing so via e. g. confounded cognitive tests.)
Once you’ve established that, you can try more complicated schemes. E. g., taking 100 mg on even days and 200 mg on odd days. Some caffeine effects are plausibly not destroyed by tolerance, so this schedule lets you reap those every day, and have full caffeine effects every second day. (Again, you can test for nonlinear tolerance build-up effects by following this schedule for 1-2 weeks, then taking a larger dose of 300-400 mg[3], and seeing where its effect lies on the “100 mg on a toleranceless stomach” to “way too much caffeine” spectrum.)
You can establish that baseline by stopping caffeine intake for 3 weeks, then taking a single 100 mg dose. You probably want to do that anyway for the N-day experimentation.
Counterargument: sure, good decaf coffee exists, but it’s harder to get hold of. Because it’s less popular, the decaf beans at cafés are often less fresh or from a worse supplier. Some places don’t stock decaf coffee. So if you like the taste of good coffee, taking caffeine pills may limit the amount of good coffee you can access and drink without exceeding your desired dose.
As a black tea enjoyer I would argue it’s practically non existent, no decaf black tea I’d ever tried even comes close to the best “normal” black tea sorts.
This is true of all teas. The decaf ones all are terrible. I spent a while trying them in the hopes of cutting down my caffeine consumption, but the taste compromise is severe. And I’d say that the black decaf teas were the best I tried, mostly because they tend to have much more flavor & flavorings, so there was more left over from the water or CO2 decaffeination...
there are plenty of other common stimulants, but caffeine is by far the most commonly used — and also the most likely to be taken mixed into a tasty drink, rather than in a pill.
Beware mistaking a “because” for an “and”. Sometimes you think something is X and Y, but it turns out to be X because Y.
For instance, I was recently at a metal concert, and helped someone off the ground in a mosh pit. Someone thanked me afterwards but to me it seemed like the most obvious thing in the world.
A mosh pit is not fun AND a place where everyone helps each other. It is fun BECAUSE everyone helps each other. Play-acting aggression while being supportive is where the fun is born.
If you don’t believe in your work, consider looking for other options
I spent 15 months working for ARC Theory. I recently wrote up why I don’t believe in their research. If one reads my posts, I think it should become very clear to the reader that either ARC’s research direction is fundamentally unsound, or I’m still misunderstanding some of the very basics after more than a year of trying to grasp it. In either case, I think it’s pretty clear that it was not productive for me to work there. Throughout writing my posts, I felt an intense shame imagining readers asking the very fair question: “If you think the agenda is so doomed, why did you keep working on it?”[1]
In my first post, I write: “Unfortunately, by the time I left ARC, I became very skeptical of the viability of their agenda.”This is not quite true. I was very skeptical from the beginning, for largely similar reasons I expressed in my posts. But first I told myself that I should stay a little longer. Either they manage to convince me that the agenda is sound, or I demonstrate that it doesn’t work, in which case I free up the labor of the group of smart people working on the agenda. I think this was initially a somewhat reasonable position, though it was already in large part motivated reasoning.
But half a year after joining, I don’t think this theory of change was very tenable anymore. It was becoming clear that our arguments were going in circles. I couldn’t convince Paul and Mark (the two people thinking the most about the big picture questions), nor could they convince me. Eight months in, two friends visited me in California, and they noticed that I always derailed the conversation when they asked me about my research. I think that should have been an important thing to notice that I was ashamed to talk about my research to my friends, because I was afraid they would see how crazy it was. I should have quit then, but I stayed for another seven months.
I think this was largely due to cowardice. I’m very bad at coding and all my previous attempts at upskilling in coding went badly.[2] I thought of my main skill as being a mathematician, and I wanted to keep working on AI safety. The few other places one can work as a mathematician in AI safety looked even less promising to me than ARC. I was afraid that if I quit, I wouldn’t find anything else to do.
In retrospect, this fear was unfounded. I realized there were other skills one can develop, not just coding. In my afternoons, I started reading a lot more papers and serious blog posts [3] from various branches of AI safety. After a few months, I felt I had much more context on many topics. I started to think more about what I can do with my non-mathematical skills. When I finally started applying for jobs, I got an offer from the European AI Office and UKAISI, and it looked more likely than not that I would get an offer from Redwood. [4]
Other options I considered that looked less promising than the three above, but still better than staying at ARC: Team up with some Hungarian coder friends and execute some simple but interesting experiments I had vague plans for. [5] Assemble a good curriculum for the prosaic AI safety agendas that I like. Apply for a grant-maker job. Become a Joe Carlsmith-style general investigator. Try to become a journalist or an influential blogger. Work on crazy acausal trade stuff.
I still think many of these were good opportunities, and probably there are many others. Of course, different options are good for people with different skill profiles, but I really believe that the world is ripe with opportunities to be useful for people who are generally smart and reasonable and have enough context on AI safety. If you are working on AI safety but don’t really believe that your day-to-day job is going anywhere, remember that having context and being ingrained in the AI safety field is a great asset in itself,[6] and consider looking for other projects to work on.
(Important note: ARC was a very good workplace, my coworkers were very nice to me and receptive to my doubts, and I really enjoyed working there except for feeling guilty that my work is not useful. I’m also not accusing the people who continue working at ARC of being cowards in the way I have been. They just have a different assessment of ARC’s chances, or work on lower-level questions than I have, where it can be reasonable to just defer to others on the higher-level questions.)
(As an employee of the European AI Office, it’s important for me to emphasize this point: The views and opinions of the author expressed herein are personal and do not necessarily reflect those of the European Commission or other EU institutions.)
No, really, it felt very bad writing the posts. It felt like describing how I worked for a year on a scheme that was either trying to build perpetuum mobile machines, or trying to build normal cars, I just missed the fact that gasoline exists. Embarrassing either way.
How exactly are you measuring coding ability? What are the ways you’ve tried to upskill, and what are common failure modes? Can you describe your workflow at a high-level, or share a recording? Are you referring to competence at real world engineering tasks, or performance on screening tests?
If one reads my posts, I think it should become very clear to the reader that either ARC’s research direction is fundamentally unsound, or I’m still misunderstanding some of the very basics after more than a year of trying to grasp it.
I disagree. Instead, I think that either ARC’s research direction is fundamentally unsound, or you’re still misunderstanding some of the finer details after more than a year of trying to grasp it. Like, your post is a few layers deep in the argument tree, and the discussions we had about these details (e.g. in January) went even deeper. I don’t really have a position on whether your objections ultimately point at an insurmountable obstacle for ARC’s agenda, but if they do, I think one needs to really dig into the details in order to see that.
That’s not how I see it. I think the argument tree doesn’t go very deep until I lose the the thread. Here are a few, slightly stylized but real, conversations I had with friends who had no context on what ARC was doing, when I tried to explain our research to them:
Me: We want to to do Low Probability Estimation.
Them: Does this mean you want to estimate the probability that ChatGPT says a specific word after a 100 words on chain of thought? Isn’t this clearly impossible?
Me: No, you see, we only want to estimate the probabilities only as well as the model knows.
Them: What does this mean?
Me: [I can’t answer this question.]
Me: We want to do Mechanistic Anomaly Detection.
Them: Isn’t this clearly impossible? Won’t this result in a lot of false positives when anything out of distribution happens?
Me: Yes, why we have this new clever idea of relying on the fragility of sensor tampering, that if you delete a subset of the actions, you will get an inconsistent image.
Them: What if the AI builds another robot to tamper with the cameras?
Me: We actually don’t want to delete actions but heuristic arguments for why the cameras will show something, and we want to construct heuristic explanations in a way that they carry over through delegated actions.
Them: What does this mean?
Me; [I can’t answer this question.]
Me: We want to create Heuristic Arguments to explain everything the model does.
Them: What does it mean that an argument explained a behavior? What is even the type signature of heuristic arguments? And you want to explain everything a model does? Isn’t this clearly impossible?
Me: [I can’t answer this question.]
When I was explaining our research to outsiders (which I usually tried to avoid out of cowardice), we usually got to some of these points within minutes. So I wouldn’t say these are fine details of our agenda.
During my time at ARC, the majority of my time was spent on asking variations of these three questions from Mark and Paul. They always kindly answered, and the answer was convincing-sounding enough for the moment that I usually couldn’t really reply on the spot, and then I went back to my room to think through their answers. But I never actually understood their answers, and I can’t reproduce them now. Really, I think that was the majority of work I did at ARC. When I left, you guys should have bought a rock with “Isn’t this clearly impossible?” written on it, and that would profitably replace my presence.
That’s why I’m saying that either ARC’s agenda is fundamentally unsound or I’m still missing some of the basics. What is standing between ARC’s agenda collapsing from five minutes of questioning from an outsider is that Paul and Mark (and maybe others in the team) have some convincing-sounding answers to the three questions above. So I would say that these answers are really part of the basics, and I never understood them.
Maybe Mark will show up in the comments now to give answers to the three questions, and I expect the answers to sound kind of convincing, and I won’t have a very convincing counter-argument other than some rambling reply saying essentially that “I think this argument is missing the point and doesn’t actually answer the question, but I can’t really point out why, because I don’t actually understand the argument because I don’t understand how you imagine heuristic arguments”. (This is what happened in the comments on my other post, and thanks to Mark for the reply and I’m sorry for still not understanding it.) I can’t distinguish whether I’m just bad at understanding some sound arguments here, or the arguments are elaborate self-delusions of people who are smarter and better at arguments than me. In any case, I feel epistemic learned helplessness on some of these most basic questions in ARC’s agenda.
What is your opinion on the Low Probability Estimation paper published this year at ICLR?
I don’t have a background in the field, but it seems like they were able to get some results, that indicate the approach is able to extract some results. https://arxiv.org/pdf/2410.13211
It’s a nice paper, and I’m glad they did the research, but importantly, the paper reports a negative result about our agenda. The main result is that the method inspired by our ideas under-performs the baseline. Of course, these are just the first experiments, work is ongoing, this is not conclusive negative evidence for anything. But the paper certainly shouldn’t be counted as positive evidence for ARC’s ideas.
I was very skeptical from the beginning, for largely similar reasons I expressed in my posts. But first I told myself that I should stay a little longer.
IME, in the majority of cases, when I strongly felt like quitting but was also inclined to justify “staying just a little bit longer because XYZ”, and listened to my justifications, staying turned out to be the wrong decision.
Little is known about whether people make good choices when facing important decisions. This paper reports on a large-scale randomized field experiment in which research subjects having difficulty making a decision flipped a coin to help determine their choice. For important decisions (e.g. quitting a job or ending a relationship), those who make a change (regardless of the outcome of the coin toss) report being substantially happier two months and six months later. This correlation, however, need not reflect a causal impact. To assess causality, I use the outcome of a coin toss. Individuals who are told by the coin toss to make a change are much more likely to make a change and are happier six months later than those who were told by the coin to maintain the status quo. The results of this paper suggest that people may be excessively cautious when facing life-changing choices.
Pretty much the whole causal estimate comes down to the influence of happiness 6 months after quitting a job or breaking up. Almost everything else is swamped with noise. The only individual question with a consistent causal effect larger than the standard error was “should I break my bad habit?”, and doing so made people unhappier. Even for those factors, there’s a lot of biases in this self-report data, which the authors noted and tried to address. I’m just not sure what we can really learn from this, even though it is a fun study.
In 2027 the trend that began in 2024 with OpenAI’s o1 reasoning system has continued. The compute required to run AI is no longer negligible compared to the cost of training it. Models reason over long periods of time. Their effective context windows are massive, they update their underlying models continuously, and they break tasks down into sub-tasks to be carried out in parallel. The base LLM they are built on is two generations ahead of GPT-4.
These systems are language model agents. They are built with self-understanding and can be configured for autonomy. These constitute proto-AGI. They are artificial intelligences that can perform much but not all of the intellectual work that humans can do (although even what these AI can do, they cannot necessarily do cheaper than a human could).
In 2029 people have spent over a year working hard to improve the scaffolding around proto-AGI to make it as useful as possible. Presently, the next generation of LLM foundational model is released. Now, with some further improvements to the reasoning and learning scaffolding, this is true AGI. It can perform any intellectual task that a human could (although it’s very expensive to run at full capacity). It is better at AI research than any human. But it is not superintelligence. It is still controllable and its thoughts are still legible. So, it is put to work on AI safety research. Of course, by this point much progress has already been made on AI safety—but it seems prudent to get the AGI to look into the problem and get its go-ahead before commencing with the next training run. After a few months the AI declares it has found an acceptable safety approach. It spends some time on capabilities research then the training run for the next LLM begins.
In 2030 the next LLM is completed, and improved scaffolding is constructed. Now human-level AI is cheap, better-than-human-AI is not too expensive, and the peak capabilities of the AI are almost alien. For a brief period of time the value of human labour skyrockets, workers acting as puppets as the AI instructs them over video-call to do its bidding. This is necessary due to a major robotics shortfall. Human puppet-workers work in mines, refineries, smelters, and factories, as well as in logistics, optics, and general infrastructure. Human bottlenecks need to be addressed. This takes a few months, but the ensuing robotics explosion is rapid and massive.
2031 is the year of the robotics explosion. The robots are physically optimised for their specific tasks, coordinate perfectly with other robots, are able to sustain peak performance, do not require pay, and are controlled by cleverer-than-human minds. These are all multiplicative factors for the robots’ productivity relative to human workers. Most robots are not humanoid, but let’s say a humanoid robot would cost $x. Per $x robots in 2031 are 10,000 more productive than a human. This might sound like a ridiculously high number: one robot the equivalent of 10,000 humans? But let’s do some rough math:
Advantage | Productivity Multiplier (relative to skilled human)
Physically optimised for their specific tasks | 5
Coordinate perfectly with other robots | 10
Able to sustain peak performance | 5
Do not require pay | 2
Controlled by cleverer-than-human minds | 20
5*10*5*2*20 = 10,000
Suppose that a human can construct one robot per year (taking into account mining and all the intermediary logistics and manufacturing). With robots 10^4 times as productive as humans, each robot will construct an average of 10^4 robots per year. This is the robotics explosion. By the end of the year there will be a 10^11 robots (more precisely, an amount of robots that is cost-equivalent to 10^11 humanoid robots).
By 2032 there are 10^11 robots, each with the productivity of 10^4 skilled human workers. That is a total productivity equivalent to 10^15 skilled human workers. This is roughly 10^5 times the productivity of humanity in 2024. At this point trillions of advanced processing units have been constructed and are online. Industry expands through the Solar System. The number of robots continues to balloon. The rate of research and development accelerates rapidly. Human mind upload is achieved.
It’s been 7 months since I wrote the comment above. Here’s an updated version.
It’s 2025 and we’re currently seeing the length of tasks AI can complete double each 4 months [0]. This won’t last forever [1]. But it will last long enough: well into 2026. There are twenty months from now until the end of 2026, so according to this pattern we can expect to see 5 doublings from the current time-horizon of 1.5 hours, which would get us to a time-horizon of 48 hours.
But we should actually expect even faster progress. This for two reasons:
(1) AI researcher productivity will be amplified by increasingly-capable AI [2]
(2) the difficulty of each subsequent doubling is less [3]
This second point is plain to see when we look at extreme cases: Going from 1 minute to 10 minutes necessitates vast amounts of additional knowledge and skill; from 1 year to 10 years very little of either. The amount of progress required to go from 1.5 to 3 hours is much more than from 24 to 48 hours, so we should expect to see doublings take less than 4 months in 2026, so instead of reaching just 48 hours, we may reach, say, 200 hours.
200 hour time horizons entail agency: error-correction, creative problem solving, incremental improvement, scientific insight, and deeper self-knowledge will all be necessary to carry out these kinds of tasks.
So, by the end of 2026 we will have advanced AGI [4]. Knowledge work in general will be automated as human workers fail to compete on cost, knowledge, reasoning ability, and personability. The only knowledge workers remaining will be at the absolute frontiers of human knowledge. These knowledge workers, such as researchers at frontier AI labs, will have their productivity massively amplified by AI which can do the equivalent of hundreds of hours of skilled human programming, mathematics, etc. work in a fraction of that time.
The economy will not yet have been anywhere near fully-robotised (making enough robots takes time, as does the necessary algorithmic progress), so AI-directed manual labour will be in extremely high demand.
But the writing will be on the wall for all to see: full-automation, including into space industry and hyperhuman science, will be correctly seen as an inevitabilit, and AI company valuations will have increased by totally unprecedented amounts. Leading AI company market capitalisations could realistically measure in the quadrillions, and the S&P-500 in the millions [5].
In 2027 a robotics explosion ensues. Vast amounts of compute come online, space-industry gets started (humanity returns to the Moon). AI surpasses the best human AI researchers, and by the end of the year, AI models trained by superhuman AI come online, decoupled from risible human data corpora, capable of conceiving things humans are simply biologically incapable of understanding. As industry fully robotises, humans obsolesce as workers and spend their time instead in leisure and VR entertainment. Healthcare progresses in leaps and bounds and crime is under control—relatively few people die.
In 2028 mind-upload tech is developed, death is a thing of the past, psychology and science are solved. AI space industry swallows the solar system and speeds rapidly out toward its neighbhors, as ASI initiates its plan to convert the nearby universe into computronium.
Why do I expect the trend to be superexponential? Well, it seems like it sorta has to go superexponential eventually. Imagine: We’ve got to AIs that can with ~100% reliability do tasks that take professional humans 10 years. But somehow they can’t do tasks that take professional humans 160 years? And it’s going to take 4 more doublings to get there? And these 4 doublings are going to take 2 more years to occur? No, at some point you “jump all the way” to AGI, i.e. AI systems that can do any length of task as well as professional humans -- 10 years, 100 years, 1000 years, etc.
...
There just aren’t that many skills you need to operate for 10 days that you don’t also need to operate for 1 day, compared to how many skills you need to operate for 1 hour that you don’t also need to operate for 6 minutes.
[4] Here’s what I mean by “advanced AGI”:
By advanced artificial general intelligence, I mean AI systems that rival or surpass the human brain in complexity and speed, that can acquire, manipulate and reason with general knowledge, and that are usable in essentially any phase of industrial or military operations where a human intelligence would otherwise be needed. Such systems may be modeled on the human brain, but they do not necessarily have to be, and they do not have to be “conscious” or possess any other competence that is not strictly relevant to their application. What matters is that such systems can be used to replace human brains in tasks ranging from organizing and running a mine or a factory to piloting an airplane, analyzing intelligence data or planning a battle.
Good epistemic calibration of a prediction source is not impressive.
I see people being impressed by calibration charts, for example https://x.com/ESYudkowsky/status/1924529456699641982 , or stronger: https://x.com/NathanpmYoung/status/1725563206561607847
But it’s trivial to have a straight-line calibration graph, if it’s not straight just fix it for each probability by repeatedly predicting a one-sided coin’s outcome as that probability.
If you’re a prediction market platform where the probability has to be decided by dumb monkeys, just make sure that the vast majority of questions are of the form “will my p-weighted coin land heads”.
---
If a calibration graph isn’t straight, that implies epistemic free lunch—if things that you predict at 20% actually happen 30% of the time, just shift those predictions. This is probably the reason why actual prediction markets are calibrated, since incalibration leads to an easy trading strategy. But the presence of calibration is not a very interesting property.
Calibration is a super important signal of quality because it means you can actually act on the given probabilities! Even if someone is gaming calibration by betting given ratios on certain outcomes, you can still bet on their predictions and not lose money (often). That is far better than other news sources such as tweets or NYT or whatever. If a calibrated predictor and a random other source are both talking about the same thing, the fact that the predictor is calibrated is enough to make them the #1 source on that topic.
Disagree. It’s possible to get a good calibration chart in unimpressive ways, but that’s not how Polymarket & Manifold got their calibration, so their calibration is impressive.
To elaborate: It’s possible to get a good calibration graph by only predicting “easy” questions (e.g. the p-weighted coin), or by predicting questions that are gameable if you ignore discernment (e.g. 1⁄32 for each team to win the Super Bowl), or with an iterative goodharting strategy (e.g. seeing that too many of your “20%” forecasts have happened so then predicting “20%” for some very unlikely things). But forecasting platforms haven’t been using these kinds of tricks, and aren’t designed to. They came by their calibration the hard way, while predicting a diverse set of substantive questions one at a time & aiming for discernment as well as calibration. That’s an accomplishment.
When people are skeptical about the concept of AGI being meaningful or having clear boundaries, it could sometimes be downstream of skepticism about very fast and impactful R&D done by AIs, such as software-only singularity or things like macroscopic biotech where compute buildout happens at a speed impossible for human industry. Such events are needed to serve as landmarks, anchoring a clear concept of AGI, otherwise the definition remains contentious.
So AI company CEOs who complain about AGI being too nebulous to define might already be expecting a scaling slowdown, with their strategy being primarily about the fight for the soul of the 2028-2030 market. When scaling is slow, it’ll become too difficult to gain a significant quality advantage sufficient to defeat the incumbents. So the decisive battle is happening now, with the rhetoric making it more palatable to push through the decisions to build the $140bn training systems of 2028.
This behavior doesn’t need to be at all related to expecting superintelligence, it makes sense as a consequence of not expecting superintelligence in the near future.
I think short timelines just don’t square with the way intelligence agencies are behaving. The NSA took Y2K more seriously than it currently seems to be taking near-term AGI. You can make the argument that intelligence agencies are less competent than they used to be, but I don’t buy that they aren’t at least extremely paranoid and moderately competent: that seems like their job.
Researchers at AGI labs seem to genuinely believe the hype they’re selling, a significant fraction of non-affiliated top-of-the-line DL researchers is inclined to believe them as well, and basically all competent well-informed people agree that the short-timelines position is not unreasonable to hold.
Dismissing short timelines based on NSA’s behavior requires assuming that they’re much more competent in the field of AI than everyone in the above list. After all, that’d require them to be strongly (and correctly) confident that all these superstar researchers above are incorrect.
While that’s not impossible, it seems highly unlikely to me. Much more likely that they’re significantly less competent, and accordingly dismissive.
This is a late reply, but at least from this article, it seems like Ilya Sutskever was running out of confidence that OpenAI would reach AGI by mid 2023. Additionally, if the rumors about GPT-5 are true, it’s mainly going to be a unification of existing models rather than something entirely new. Combined with the GPT-4.5 release, it sure seems like progress at OpenAI is slowing down rather than speeding up.
How do you know that researchers at AGI labs genuinely believe what they’re saying? Couldn’t the companies just put pressure on them to act like they believe Transformative AI is imminent? I just don’t buy that these agencies are dismissive without good reason. They’ve explored remote viewing and other ideas that are almost certainly bullshit. If they are willing to consider those possibilities, I don’t know why they wouldn’t consider the possibility of current deep learning techniques creating a national security threat. That seems like their job, and they’ve explored significantly weirder ideas.
On what possible publicly-unavailable evidence could they have updated in order to correctly attain such a high degree of dismissiveness?
I could think of three types of evidence:
Strong theoretical reasons.
E. g., some sort of classified, highly advanced, highly empirically supported theory of deep learning/intelligence/agency, such that you can run a bunch of precise experiments, or do a bunch of math derivations, and definitively conclude that DL/LLMs don’t scale to AGI.
Empirical tests.
E. g., perhaps the deep state secretly has 100x the compute of AGI labs, and they already ran the pretraining game to GPT-6 and been disappointed by the results.
Overriding expert opinions.
E. g., a large number of world-class best-of-the-best AI scientists with an impeccable track record firmly and unanimously saying that LLMs don’t scale to AGI. This requires either a “shadow industry” of AI experts working for the government, or for the AI-expert public speakers to be on the deep state’s payroll and lying in public about their uncertainty.
I mean, I guess it’s possible that what we see of the AI industry is just the tip of the iceberg and the government has classified research projects that are a decade ahead of the public state of knowledge. But I find this rather unlikely.
And unless we do postulate that, I don’t see any possible valid pathway by which they could’ve attained high certainty regarding the current paradigm not working out.
There are two ways we can update on it:
The fact that they investigated psychic phenomena means they’re willing to explore a wide variety of ambitious ideas, regardless of their weirdness – and therefore we should expect them not do dismiss the AGI Risk out of hand.
The fact that they investigated psychic phenomena means they have a pretty bad grip on reality – and therefore we should not expect them to get the AGI Risk right.
I never looked into it enough to know which interpretation is the correct one. Expecting less competence rather than more is usually a good rule of thumb, though.
To be clear, I personally very much agree with that. But:
I find that I’m not inclined to take Sutskever’s current claims about this at face value. He’s raising money for his thing, he has a vested interest in pushing the agenda that the LLM paradigm is a dead end and that his way is the only way. Same how it became advantageous for him to talk about the data wall once he’s no longer with the unlimited-compute company.
Again, I do believe both in LLMs being a dead end and in the data wall. But I don’t trust Sutskever to be a clean source of information regarding that, so I’m not inclined to update on his claims to that end.
Those are good points. The last thing i’ll say drastically reduces the amount of competence required by the government in order for them to be dismissive while still being rational, and it is that the leading AI labs may already be fairly confident that the current techniques of deep-learning won’t get to AGI in the near-future, so the security agencies know this as well.
That would make sense. But I doubt all AGI companies are that good at informational security and deception. This would require all of {OpenAI, Anthropic, DeepMind, Meta, xAI} to decide on the deceptive narrative, and then not fail to keep up the charade, which would require both sending the right public messages and synchronizing their research publications such that the set of paradigm-damning ones isn’t public.
In addition, how do we explain people who quit AGI companies and remain with short timelines?
I guess I would respond to the first point by saying all of the companies you mentioned have incentive to say they are closing in on AGI even if they aren’t. It doesn’t seem that sophisticated to say “we’re close to AGI” when you’re not. Mark Zuckerberg said that AI would be at the level of a junior SWE this year, and Meta proceeded to release Llama 4. Unless prognosticators at Meta seriously fucked up, the most likely scenario is that Zuckerberg made that comment knowing it was bullshit. And the sharing of research did slow down a lot in 2023, which gave companies cover to not release unflattering results.
And to your last point, it seems reasonable that companies could pressure former employees to act as if they believe AGI is imminent. And some researchers may be emotionally invested in believing that what they worked on is what will lead to superintelligence.
And my question for you is: if DeepMind had solid evidence that AGI would be here in 1 year, and if the security agencies had access to DeepMind’s evidence and reasoning, do you believe they would still do nothing?
As someone who thinks superintelligence could come in the near future, I basically agree with @snewman’s view that AIs have to automate the entire economy, or automate a sector that could then automate everything else very fast, but unfortunately for us this basically gives us no good fire alarms for AGI unless @Ege Erdil and @Matthew Barnett et al are right that takeoff is slow enough that most value comes from broad automation, and external use dominates internal use:
https://amistrongeryet.substack.com/p/defining-agi
Is there a way to use policy markets to make FDT decisions instead of EDT decisions?
I think the first question to think about is how to use them to make CDT decisions. You can create a market about a causal effect if you have control over the decision and you can randomise it to break any correlations with the rest of the world, assuming the fact that you’re going to randomise it doesn’t otherwise affect the outcome (or bettors don’t think it will).
Committing to doing that does render the market useless for choosing policy, but you could randomly decide whether to randomise or to make the decision via whatever the process you actually want to use, and have the market be conditional on the former. You probably don’t want to be randomising your policy decisions too often, but if liquidity wasn’t an issue you could set the probability of randomisation arbitrarily low.
Then FDT… I dunno, seems hard.
I agree with all of this! A related shortform here.
An interesting development in the time since your shortform was written is that we can now try these ideas out without too much effort via Manifold.
Anyone know of any examples?
Worked on this with Demski. Video, report.
Any update to the market is (equivalent to) updating on some kind of information. So all you can do is dynamically choose what to do or do not update on.* Unfortunately, whenever you choose not to update on something, you are giving up on the asymptotic learning guarantees of policy market setups. So the strategic gains from updatelesness (like not falling into traps) are in a fundamental sense irreconcilable with the learning gains from updatefulness. That doesn’t prevent that you can be pretty smart about deciding what to update on exactly… but due to embededness problems and the complexity of the world, it seems to be the norm (rather than the exception) that you cannot be sure a priori of what to update on (you just have to make some arbitrary choices).
*For avoidance of doubt, what matters for whether you have updated on X is not “whether you have heard about X”, but rather “whether you let X factor into your decisions”. Or at least, this is the case for a sophisticated enough external observer (assessing whether you’ve updated on X), not necessarily all observers.
On YouTube, @Evbo’s parkour civilization and PVP civilization drama movies, professionally produced, set in Minecraft, and half-parody of YA dystopia serves as a surprisingly good demonstration of Instrumental Convergence (the protagonist kills or bribes most people they meet to “rank up” in the beginning), and non-human morality (the characters basically only care about the Minecraft activity of their series, without a hint of irony).
I think using existing non-AI media as an analogy for AI could be helpful, because people think that a terminator-like ASI would be robots shooting people, one of the reasons why a common suggestion for unaligned AI is to just turn it off, pour water on the servers, etc.
Link: https://www.youtube.com/@Evbo
I’ve been thinking through the following philosophical argument for the past several months.
1. Most things that currently exist have properties that allow them to continue to exist for a significant amount of time and propagate, since otherwise, they would cease existing very quickly.
2. This implies that most things capable of gaining adaptations, such as humans, animals, species, ideas, and communities, have adaptations for continuing to exist.
3. This also includes decision-making systems and moral philosophies.
4. Therefore, one could model the morality of such things as tending towards the ideal of perfectly maintaining their own existence and propagating as much as possible.
Many of the consequences of this approximation of the morality of things seem quite interesting. For instance, the higher-order considerations of following an “ideal” moral system (that is, utilitarianism using a measure of one’s own continued existence at a point in the future) lead to many of the same moral principles that humans actually have (e.g. cooperation, valuing truth) while also avoiding a lot of the traps of other systems (e.g. hedonism). This chain of thought has led me to believe that existence itself could be a principal component of real-life morality.
While it does have a lot of very interesting conclusions, I’m very concerned that if I were to write about it, I would receive 5 comments directing me to some passage by a respected figure that already discusses the argument, especially given the seemingly incredibly obvious structure it has. However, I’ve searched through LW and tried to research the literature as well as I can (through Google Scholar, Elicit, and Gemini, for instance), but I must not have the right keywords, since I’ve come up fairly empty, other than for philosophers with vaguely similar sounding arguments that don’t actually get at the heart of the matter (e.g. Peter Singer’s work comes up a few times, but he particularly focused on suffering rather than existence itself, and certainly didn’t use any evolutionary-style arguments to reach that conclusion).
If this really hasn’t been written about extensively anywhere, I would update towards believing the hypothesis that there’s actually some fairly obvious flaw that renders it unsound, stopping it from getting past, say, the LW moderation process or the peer review process. As such, I suspect that there is some issue with it, but I’ve not really been able to pinpoint what exactly stops someone from using existence as the fundamental basis of moral reasoning.
Would anyone happen to know of links that do directly explore this topic? (Or, alternatively, does anyone have critiques of this view that would spare me the time of writing more about this if this isn’t true?)
links 5/19/25: https://roamresearch.com/#/app/srcpublic/page/05-19-2025
https://arxiv.org/html/2502.15631 diminishing performance returns to reasoning token number.
more difficult questions get allocated more reasoning tokens.
higher reasoning “level” doesn’t
https://arxiv.org/html/2503.04697 there is an an optimal amount of reasoning to do at inference-time and doing more degrades performance. (“overthinking”). you can design models to estimate this for themselves.
see also:
https://arxiv.org/html/2504.05185
https://arxiv.org/html/2504.21370
https://arxiv.org/abs/2412.18547
more on “overthinking” degrading performance:
https://arxiv.org/html/2410.21333
https://arxiv.org/abs/2502.07266
https://www.paulgraham.com/safe.html this is an analytical argument by Paul Graham; you can be “nice” (not capture the most value, in zero-sum contexts) and it won’t affect your revenue/profits much at all compared to your growth rate.
https://en.m.wikipedia.org/wiki/Monte_Cristo_sandwich -- it’s dipped in powdered sugar (!)
things I looked up while reading about the French Revolution
https://en.m.wikipedia.org/wiki/Anne_Robert_Jacques_Turgot physiocrat of my heart!
https://en.m.wikipedia.org/wiki/Charles_Alexandre_de_Calonne spendiest minister ever
https://en.m.wikipedia.org/wiki/Guillaume-Chr%C3%A9tien_de_Lamoignon_de_Malesherbes critic of royal absolutism
https://en.m.wikipedia.org/wiki/Montesquieu
https://en.m.wikipedia.org/wiki/The_Spirit_of_Law
https://en.m.wikipedia.org/wiki/Germaine_de_Sta%C3%ABl political theorist & salonniere
Periodic reminder: AFAIK (though I didn’t look much) no one has thoroughly investigated whether there’s some small set of molecules, delivered to the brain easily enough, that would have some major regulatory effects resulting in greatly increased cognitive ability. (Feel free to prove me wrong with an example of someone plausibly doing so, i.e. looking hard enough and thinking hard enough that if such a thing was feasible to find and do, then they’d probably have found it—but “surely, surely, surely someone has done so because obviously, right?” is certainly not an accepted proof. And don’t call me Shirley!)
I’m simply too busy, but you’re not!
https://www.lesswrong.com/posts/jTiSWHKAtnyA723LE/overview-of-strong-human-intelligence-amplification-methods#Signaling_molecules_for_creative_brains
Since 1999 there have been “Doogie” mice that were genetically engineered to overexpress NR2B in their brain, and they were found to have significantly greater cognitive function than their normal counterparts, even performing twice as well on one learning test.
No drug AFAIK has been developed that selectively (and safely) enhances NR2B function in the brain, which would best be achieved by a positive allosteric modulator of NR2B, but also no drug company has wanted to or tried to specifically increase general intelligence/IQ in people, and increasing IQ in healthy people is not recognized as treating a disease or even publicly supported.
The drug SAGE718 comes close, but it is a pan-NMDA allosteric (which still showed impressive increases in cognitive end-points in its trial)
Theoretically, if we try to understand how general intelligence/IQ works in a pharmacological sense, then we should be able to develop drugs that affect IQ.
Two ways to do that is investigating the neurological differences between individuals with high IQ and those with average IQ, and mapping out the function of brain regions implicated in IQ e.g. the dorsolateral prefrontal cortex (dlPFC).
If part of the differences between high and average IQs is neurotransmitter based and could be emulated with small molecules, then such drugs could be developed. Genomic studies already link common variation in postsynaptic NMDA-complex genes and in nicotinic receptor genes (e.g., CHRNA4) to small differences in cognitive test scores across populations.
Likewise, key brain regions like the dlPFC could be positively modulated, e.g. we know persistent‐firing delay cells in the macaque dlPFC rely on slow NMDA-receptor-mediated recurrent excitation, and their activity is mainly gated by acetylcholine acting at both α7 nicotinic and M1 muscarinic receptors. So, positively tuning delay cell firing with a7 and M1 ligands augments your dlPFC. Indeed, electrophysiological and behavioral experiments show that low-dose stimulation or positive-allosteric modulation (PAM) of either receptor subtype enhances delay-period firing and working-memory performance, whereas blockade or excessive stimulation impairs them.
There are in fact drugs, either very recently developed and currently in trials with sound mechanisms as described above that support significant cognitive enhancement, or that have already shown very impressive cognitive improvement in animals and humans in past trials despite not being specifically developed for cognitive enhancement and rather diseases like Alzheimer’s, or conditions like depression, e.g. TAK653 (AMPA PAM), ACD856 (TrkB PAM), Tropisetron (a7 partial agonist), Neboglamine (NMDA Glycine PAM), BPN14770 (PDE4D NAM), SAGE718 (NMDA PAM), TAK071 and AF710B (M1 positive allosterics).
There is a small community of nootropics enthusiasts (r/Nootopics and its discord) that have tried and tested some of these compounds and reported significant cognitive enhancement, with TAK653 increasing IQ by as much as 7 points in relatively decent online IQ tests (e.g. mensa.no, mensa.dk) and also professionally administered tests (that weren’t taken twice to minimize retake effects), and cognitive benchmarks like humanbenchmark.com, as well as the WAIS Digit Span subtest likewise showing improvements. The rationale behind TAK653 (also called “Osavampator”) and AMPA PAMs is that positive-allosteric modulators of AMPA-type glutamate receptors such as TAK653 boost the size and duration of fast excitatory postsynaptic currents without directly opening the channel. That extra depolarization recruits NMDA receptors, calcium influx, and a rapid BDNF-mTOR signaling cascade that produces spine growth and long-term potentiation. In rodents, low-nanomolar brain levels of TAK-653 have been shown to rescue or enhance recognition memory, spatial working memory and attentional accuracy; in a double-blind cross-over Phase-1 study the same compound sped psychomotor responding and stroop task performance in healthy volunteers.
Some great, well written and cited write ups I would encourage you to read if you have the time:
https://www.reddit.com/r/NooTopics/s/9NHUPgxDph
https://www.reddit.com/r/NooTopics/s/4Bh1nnv5sl
https://www.reddit.com/r/NooTopics/s/jsqz2m604o
https://www.reddit.com/r/NooTopics/s/vrT5Ii8MyN
I did a high-level exploration of the field a few years ago. It was rushed and optimized more for getting it out there than rigor and comprehensiveness, but hopefully still a decent starting point.
I personally think you’d wanna first look at the dozens of molecules known to improve one or another aspect of cognition in diseases (e.g. Alzheimer’s and schizophrenia), that were never investigated for mind enhancement in healthy adults.
Given that some of these show very promising effects (and are often literally approved for cognitive enhancement in diseased populations), given that many of the best molecules we have right now were initially also just approved for some pathology (e.g. methylphenidate, amphetamine, modafinil), and given that there is no incentive for the pharmaceutical industry to conduct clinical trials on healthy people (FDA etc. do not recognize healthy enhancement as a valid indication), there seems to even be a sort of overhang of promising molecule candidates that were just never rigorously tested for healthy adult cognitive enhancement.
https://forum.effectivealtruism.org/posts/hGY3eErGzEef7Ck64/mind-enhancement-cause-exploration
Appendix C includes a list of ‘almost deployable’ candidates:
Amantadine, Amisulpride, Amphetamine (incl. dexamphetamine, levoamphetamine) Aripiprazole. Armodafinil, Atomoxetine, Brexiprazole, Bupropion Carbidopa-levodopa, Clonidine, Desvenlaflaxine, Donepezil, Duloxetine, Entacapone, Folic acid, Galantamine, Gingko Biloba, Guanfacine, Istradefylline, Ketamine, Lisdexamphetamine, Memantine, Methamphetamine, Methylphenidate (incl. dexmethylphenidate), Modafinil, Opicapone, Piracetam, Pitolisant, Pramipexole, Rasagiline, Reboxetine, Rivastigmine, Ropinirole, Rotigotine, Safinamide, Selegiline, Sodium oxybate, Tacrine, Tolcapone, Venlafaxine, Viloxazine, Vortioxetine
Thanks. Seems worth looking into more. I googled the first few on your list, and they’re all described as working via some neurotransmitter / receptor type, either agonist / antagonist / reuptake inhibition. Not everything on the list is like that (I recognize gingko biloba as being related to blood flow). But I don’t think these sorts of things would stack at all, or even necessarily help much with someone who isn’t sick / has some big imbalance or whatever.
My hope for something like this existing is a bit more specific. It comes from thinking that there should be small levers with large effects, because natural development probably pulls some such levers which activate specific gene regulatory networks at different points—e.g. first we pull the [baby-type brain lever], then the [5 year old brain lever], etc.
AFAIK pharmaceutical research is kind of at an impasse because virtually all the small molecules that are easily delivered and have any chance to do anything have been tested and are either found useful or not. New pharmaceuticals need to explore more complex chemical spaces, like artificial antibodies. So I think if there was anything simple that has this effect (the way, say, caffeine makes you wake up) we would know.
Perhaps you misread the OP as saying “small molecules” rather than “small set of molecules”.
Fair, though generally I conflated them because if your molecules aren’t small, due to sheer combinatorics the set of the possible candidates becomes exponentially massive. And then the question is “ok but where are we supposed to look, and by which criterion?”.
Thanks. One of the first places I’d look would be hormones, which IIUC don’t count as small molecules? Though maybe natural hormones have already been tried? But I’d wonder about more obscure or risky ones, e.g. ones normally only active in children.
Wait I realised I no longer believe this.
This seems interesting and worth writing more on. Maybe later.
I’m very interested in hearing counterarguments. I have not put a lot of thought into it.
To do useful work you need to be deceptive.
When you and another person have different concepts of what’s good.
When both of you have the same concepts of what’s good but different models of how to get there.
This happens a lot when people are perfectionist and have aesthetic preferences for work being done in a certain way.
This happens in companies a lot. AI will work in those contexts and will be deceptive if it wants to do useful work. Actually maybe not, the dynamics will be different, like AI being neutral in some way like anybody can turn honesty mode and ask it anything.
Anyway, I think because of the way companies are structured and how humans work being slightly deceptive allows you to do useful work (I think it’s pretty intuitive for anyone who worked in a corporation or watched the office)
It probably doesn’t apply to AI corporations?
I don’t get the down votes. I do think it’s extremely simple—look at politics in general or even workplace politics, just try to google it, there even wikipedia pages roughly about what I want to talk about. I have experienced a situation where I need to do my job and my boss makes it harder for me in some way many times—being not completely honest is an obvious strategy and it’s good for the company you are working at
I think the downvotes is because the correct statement is something more like “In some situations, you can do more useful work by being deceptive.” I think this is actually what you argue for, but it’s very different from “To do useful work you need to be deceptive.”
If “To do useful work you need to be deceptive.” this means that one can’t do useful work without being deceptive. This is clearly wrong.
It seems like both me and you are able to decipher what I meant easily—why someone failed to do that
The more AI companies suppress AI via censorship, the bigger the black market for completely uncensored models will be. Their success is therefore digging our own grave. In other words, mundane alignment has a net negative effect.
The confusion (in popular press, not so much among professionals or here) between censorship and alignment is a big problem. Censorship and hamfisted late-stage RL is counterproductive to alignment, both for the reason you give (increases demand for grey-market tools) and because it makes serious misalignment much less easy to notice.
Sometimes people talk about how AIs will be very superhuman at a bunch of (narrow) domains. A key question related to this is how much this generalizes. Here are two different possible extremes for how this could go:
It’s effectively like an attached narrow weak AI: The AI is superhuman at things like writing ultra fast CUDA kernels, but from the AI’s perspective, this is sort of like it has a weak AI tool attached to it (in a well integrated way) which is superhuman at this skill. The part which is writing these CUDA kernels (or otherwise doing the task) is effectively weak and can’t draw in a deep way on the AI’s overall skills or knowledge to generalize (likely it can shallowly draw on these in a way which is similar to the overall AI providing input to the weak tool AI). Further, you could actually break out these capabilities into a separate weak model that humans can use. Humans would use this somewhat less fluently as they can’t use it as quickly and smoothly due to being unable to instantaneously translate their thoughts and not being absurdly practiced at using the tool (like AIs would be), but the difference is ultimately mostly convenience and practice.
Integrated superhumanness: The AI is superhuman at things like writing ultra fast CUDA kernels via a mix of applying relatively general (and actually smart) abilities, having internalized a bunch of clever cognitive strategies which are applicable to CUDA kernels and sometimes to other domains, as well as domain specific knowledge and heuristics. (Similar to how humans learn.) The AI can access and flexibly apply all of the things it learned from being superhuman at CUDA kernels (or whatever skill) and with a tiny amount of training/practice it can basically transfer all these things to some other domain even if the domain is very different. The AI is at least as good at understanding and flexibly applying what it has learned as humans would be if they learned the (superhuman) skill to the same extent (and perhaps the AIs are actually much better at this than humans). You can’t separate these capabilities into a weak model, the weak model RL’d on this (and distilled into) would either be much worse at CUDA or would need to actually be generally quite capable (rather than weak).
My sense is that the current frontier LLMs are much closer to (1) than (2) for most of their skills, particularly the skills which they’ve been heavily trained on (e.g. next token prediction or competitive programming). As AIs in the current paradigm get more capable, they appear to shift some toward (2) and I expect that at the point when AIs are capable of automating virtually all cognitive work that humans can do, we’ll be much closer to (2). That said, it seems likely that powerful AIs built in the current paradigm[1] which otherwise match humans at downstream performance will somewhat lag behind humans in integrating/generalizing skills they learn (at least without spending a bunch of extra compute on skill integration) because this ability currently seems to be lagging behind other capabilities relative to humans and AIs can compensate for worse skill integration with other advantages (being extremely knowledgeable, fast speed, parallel training on vast amounts of relevant data including “train once, deploy many”, better memory, faster and better communication, etc).
I think different views about the extent to which future powerful AIs will deeply integrate their superhuman abilities versus these abilities being shallowly attached partially drive some disagreements about misalignment risk and what takeoff will look like.
If the paradigm radically shifts by the time we have powerful AIs, then the relative level of integration is much less clear.
I suppose that most tasks that an LLM can accomplish could theoretically be performed more efficiently by a dedicated program optimized for that task (and even better by a dedicated physical circuit). Hypothesis 1) amounts to considering that such a program, a dedicated module within the model, is established during training. This module can be seen as a weak AI used as a tool by the stronger AI. A bit like how the human brain has specialized modules that we (the higher conscious module) use unconsciously (e.g., when we read, the decoding of letters is executed unconsciously by a specialized module).
We can envision that at a certain stage the model becomes so competent in programming that it will tend to program code on the fly, a tool, to solve most tasks that we might submit to it. In fact, I notice that this is already increasingly the case when I ask a question to a recent model like Claude Sonnet 3.7. It often generates code, a tool, to try to answer me rather than trying to answer the question ‘itself.’ It clearly realizes that dedicated code will be more effective than its own neural network. This is interesting because in this scenario, the dedicated module is not generated during training but on the fly during normal production operation. In this way, it would be sufficient for AI to become a superhuman programmer to become superhuman in many domains thanks to the use of these tool-programs. The next stage would be the on-the-fly production of dedicated physical circuits (FPGA, ASIC, or alien technology), but that’s another story.
This refers to the philosophical debate about where intelligence resides: in the tool or in the one who created it? In the program or in the programmer? If a human programmer programs a superhuman AI, should we attribute this superhuman intelligence to the programmer? Same question if the programmer is itself an AI? It’s the kind of chicken and egg debate where the answer depends on how we divide the continuity of reality into discrete categories. You’re right that integration is an interesting criterion as it is a kind of formal / non arbitrary solution to this problem of defining discrete categories among the continuity of reality.
Good articulation.
People also disagree greatly about how much humans tend towards integration rather than non-integration, and how much human skill comes from domain transfer. And I think some / a lot of the beliefs about artificial intelligence are downstream of these beliefs about the origins of biological intelligence and human expertise, i.e., in Yudkowsky / Ngo dialogues. (Object level: Both the LW-central and alternatives to the LW-central hypotheses seem insufficiently articulated; they operate as a background hypothesis too large to see rather than something explicitly noted, imo.)
Makes me wonder whether most of what people believe to be “domain transfer” could simply be IQ.
I mean, suppose that you observe a person being great at X, then you make them study Y for a while, and it turns out that they are better at Y than an average person who spend the same time studying Y.
One observer says: “Clearly some of the skills at X have transferred to the skills of Y.”
Another observer says: “You just indirectly chose a smart person (by filtering for high skills at X), duh.”
This seems important to think about, I strong upvoted!
I’m not sure that link supports your conclusion.
First, the paper is about AI understanding its own behavior. This paper makes me expect that a CUDA-kernel-writing AI would be able to accurately identify itself as being specialized at writing CUDA kernels, which doesn’t support the idea that it would generalize to non-CUDA tasks.
Maybe if you asked the AI “please list heuristics you use to write CUDA kernels,” it would be able to give you a pretty accurate list. This is plausibly more useful for generalizing, because if the model can name these heuristics explicitly, maybe it can also use the ones that generalize, if they do generalize. This depends on 1) the model is aware of many heuristics that it’s learned, 2) many of these heuristics generalize across domains, and 3) it can use its awareness of these heuristics to successfully generalize. None of these are clearly true to me.
Second, the paper only tested GPT-4o and Llama 3, so the paper doesn’t provide clear evidence that more capable AIs “shift some towards (2).” The authors actually call out in the paper that future work could test this on smaller models to find out if there are scaling laws—has anybody done this? I wouldn’t be too surprised if small models were also able to self-report simple attributes about themselves that were instilled during training.
Fair, but I think the AI being aware of its behavior is pretty continuous with being aware of the heuristics it’s using and ultimately generalizing these (e.g., in some cases the AI learns what code word it is trying to make the user say which is very similar to being aware of any other aspect of the task it is learning). I’m skeptical that very weak/small AIs can do this based on some other papers which show they fail at substantially easier (out-of-context reasoning) tasks.
I think most of the reason why I believe this is improving with capabilities is due to a broader sense of how well AIs generalize capabilities (e.g., how much does o3 get better at tasks it wasn’t trained on), but this paper was the most clearly relevant link I could find.
I’m not sure o3 does get significantly better at tasks it wasn’t trained on. Since we don’t know what was in o3′s training data, it’s hard to say for sure that it wasn’t trained on any given task.
To my knowledge, the most likely example of a task that o3 does well on without explicit training is GeoGuessr. But see this Astral Codex Ten post, quoting Daniel Kang:[1]
I think this is a bit overstated, since GeoGuessr is a relatively obscure task, and implementing an idea takes much longer than thinking of it.[2] But it’s possible that o3 was trained on GeoGuessr.
The same ACX post also mentions:
Do you have examples in mind of tasks that you don’t think o3 was trained on, but which it nonetheless performs significantly better at than GPT-4o?
Disclaimer: Daniel happens to be my employer
Maybe not for cracked OpenAI engineers, idk
I think this might be wrong when it comes to our disagreements, because I don’t disagree with this shortform.[1] Maybe a bigger crux is how valuable (1) is relative to (2)? Or the extent to which (2) is more helpful for scientific progress than (1)?
As long as “downstream performance” doesn’t include downstream performance on tasks that themselves involve a bunch of integrating/generalising.
I don’t think this explains our disagreements. My low confidence guess is we have reasonably similar views on this. But, I do think it drives parts of some disagreements between me and people who are much more optimisitic than me (e.g. various not-very-concerned AI company employees).
I agree the value of (1) vs (2) might also be a crux in some cases.
Is the crux that the more optimistic folks plausibly agree (2) is cause for concern, but believe that mundane utility can be reaped with (1), and they don’t expect us to slide from (1) into (2) without noticing?
(I don’t know how to better organize my thoughts and discoveries, and also suspect that it’s better to wait until I master speedreading, but I think it may worth to just share/ask about this one my big confusion right now as quick take)
When I was younger I considered obvious how human mind works, there were such components of it like imagination, memory etc. And of course, thoughts were words. How can it be at all possible to think not in words?
Discovery by mere observation
But some time ago I read “You are surely joking mister Feynman” for the third time, and it finally succeeded to make me go and just look into reality. What I found… Is that all my life was a lie. The Facts aren’t things which are being produced in labs via some elaborated tools and then going down through hierarchy of authority and being whispered to you by Teachers in school or by Experts on youtube.
Facts are just things which emerge when someone observes reality or deduces from information. I mean, of course I knew that you can just see things in everyday life. But these were mere mundane things, things which are from another category which can’t be compared with True Scientific Knowledge.
In the end, you can’t just introspect a little and go publish scientific papers about how human vision or thinking works, aren’t you? You can’t discover true knowledge by doing some such easy, unserious, accessible to everyone thing like… observation. Or thinking.
No, really, truths are things which said you by authority, not things which you can just… see. In the school you are said some arcane truths which came from experts, not just said to… observe a little.
And I have seen people who dared to trust their eyes more than words of scientists, they believed in dreams prophesying future and all other sorts of crazy things. It would be a terrible fate to become one of those.
But now I was just looking at reality and discovering information which could be on wikipedia page about how eye or mind work. Just by observation. Without any eye motion detectors or other tools. And it was really clear that these knowledges aren’t any qualitative way worse than those from tools in labs or authoritative sources of knowledge.
Eventually, my brain just got non standard input of direct observation and I found out that I am a type of mind who believes his own eyes more than authoritative people. Though well, it was actually new information, before I genuinely didn’t know that just by my own eyes and brain I can discover such things that usually said to you by authoritative people.
Speed of thinking
And main thing I started to observe was my mind. And I found out that I was wrong all this time about how it works, which are it’s constraints. It was like be a 12 dimensional creature but always move only in 3, and then find all these degrees of freedom and start to unfold.
And there were much much more things than I am able to say in one post even in short. But there is one thing which confuses me the most. I found out that I can think not only in words. That was strange feeling, more like in dreams where you have some incredible mental experience and then wake up and find that it was complete nonsense.
Thing which I was, as it looked, observing, was that I can think not in words, but some strange indescribable way without them. And even more, it felt multiple times faster than thinking in words. And I had that wild idea of “speedthinking”, analogous to speedreading possibility that if you are able to process others ideas in 5x speed just by using visual modality instead of audial, then maybe it is also possible to process your own ideas that fast?
And the problem was that it was too wild. If it would be possible, then… People who think 5 times faster will be like some 1000 years old vampires, they will shine, they be super fast and noticeable, at 20 they will have more cumulative knowledge than usual people at 100, and so, ever in the life.
It will be super noticeable. And then everyone will go and also learn to think visually. That wasn’t our world. Was it possible that I discovered a thing that was never or almost never discovered by anybody at all?
But then when I was practicing foreign language, I found that when I am trying to find a word, I can think without words. It wasn’t just a daydream. Conceptual thinking was real. And so I decided to just test was it possible to learn it as a skill and think 5 times faster.
Unfortunately it took much much longer to test than I thought. But now I am sure that conceptual thinking is possible, I just can do that. And I found that there were much more properties of thinking mode than just that one.
But I am still very confused by the question of which thinking is usual. People are usually by some strange reason can’t report a detailed introspection about how they think. I also tried to ask Grok about that but answers also failed to form a picture.
On the one hand, it looks like people are used to think much faster than I was thinking in words. But on the other hand, average speed of reading is 200 wpm and people usually do that by subvocalization. Which is really strange if they think not in words, or in words, but by hearing them, not pronouncing, and so much faster.
But what about unusual people? Here into my mind immediately comes up Eliezer Yudkowsky who in his glowfics gave probably the most detailed (or maybe, object level instead of metaphorical) descriptions of work of the mind I saw.
And he definitely talks like thoughts work by sequential mental audial words with their length being a cap. And also he talks about speedreading in the way of that actual restriction is speed of your mind, not speed of your body.
And I have now some understanding about how mind can work after some training, conceptual and visual thinking dozens times faster than audial by pronouncing is certainly possible. But I am still really confused how thinking works usually.
And well, also about its significance. Is it actually such a big advantage if you can think 30 times faster than somebody else, as I thought initially?
People are different. Source: 1, 2, 3.
Depends. I would expect that the difference is not only in speed, but also in precision, or maybe suitability for different kinds of problems.
It seems to me that for me “inner dialogue” works better for figuring out complicated stuff or coming to unusual conclusions, “visualization” works for geometrical and mechanical problems, “feeling” for choosing between clearly given options, and there is also something in between that I use for medium-difficulty situations.
I claim it is a lot more reasonable to use the reference class of “people claiming the end of the world” than “more powerful intelligences emerging and competing with less intelligent beings” when thinking about AI x-risk. further, we should not try to convince people to adopt the latter reference class—this sets off alarm bells, and rightly so (as I will argue in short order) - but rather to bite the bullet, start from the former reference class, and provide arguments and evidence for why this case is different from all the other cases.
this raises the question: how should you pick which reference class to use, in general? how do you prevent reference class tennis, where you argue back and forth about what is the right reference class to use? I claim the solution is you want to use reference classes that have consistently made good decisions irl. the point of reference classes is to provide a heuristic to quickly apply judgement to large swathes of situations that you don’t have time to carefully examine. this is important because otherwise it’s easy to get tied up by bad actors who avoid being refuted by making their beliefs very complex and therefore hard to argue against.
the big problem with the latter reference class is it’s not like anyone has had many experiences using it to make decisions ex ante, and if you squint really hard to find day to day examples, they don’t all work out the same way. smarter humans do mostly tend to win over less smart humans. but if you work at a zoo, you will almost always be more worried about physical strength and aggressiveness when putting different species in the same enclosure. if you run a farm (or live in Australia), you’re very worried about relatively dumb invasive animals like locusts and rabbits.
on the other hand, everyone has personally experienced a dozen different doomsday predictions. whether that’s your local church or faraway cult warning about Armageddon, or Y2K, or global financial collapse in 2008, or the maximally alarmist climate people, or nuclear winter, or peak oil. for basically all of them, the right action empirically in retrospect was to not think too much about it. there are many concrete instances of people saying “but this is different” and then getting burned.
and if you allow any reference class to be on as strong a footing as very well established reference classes, then you open yourself up to getting pwned ideologically. “all complex intricate objects we have seen created have been created by something intelligent, therefore the universe must also have an intelligent creator.” it’s a very important memetic defense mechanism.
(to be clear this doesn’t mean you can only believe things others believe, or that humans taking over earth is not important evidence, or that doomsday is impossible!! I personally think AGI will probably kill everyone. but this is a big claim and should be treated as such. if we don’t accept this, then we will forever fail to communicate with people who don’t already agree with us on AGI x-risk.)
The reference classes you should use work as a heuristic because there is some underlying mechanism that makes them work. So you should use reference classes in situations where their underlying mechanism is expected to work.
Maybe the underlying mechanism of doomsday predictions not working is that people predicting doom don’t make their predictions based on valid reasoning. So if someone uses that reference class to doubt AI risk, this should be judged as them making a claim about reasoning of people predicting AI doom being similar to people in cults predicting Armageddon.
False?
Climate change tail scenarios are worth studying and averting. Nuclear winter was obviously worth studying and averting back in the Cold War, and still is today. 2008 financial crisis was worth studying and averting.
Do you not believe average citizens can study issues like these and make moves to solve them?
You shouldn’t. This epistemic bath has no baby in it and we should throw water out of it.
This is kind of missing the point of Bayes. One shouldn’t “choose” a reference class to update on. One should update to the best of your ability on the whole distribution of hypotheses available to describe the situation. Neither is a ‘right’ or ‘wrong’ reference class to use, they’re both just valid pieces of evidence about base rates, and you should probably be using both of them.
It seems you are having in mind something like inference to the best explanation here. Bayesian updating, on the other hand, does need a prior distribution, and the question of which prior distribution to use cannot be waved away when there is a disagreement on how to update. In fact, that’s one of the main problems of Bayesian updating, and the reason why it is often not used in arguments.
I’m not really sure what that has to do with my comment. My point is the original post seemed to be operating as if you look for the argmax reference class, you start there, and then you allow arguments. My point isn’t that their prior is wrong, it’s that this whole operation is wrong.
I think also you’re maybe assuming I’m saying the prior looks something like {reference class A, reference class B} and arguing about the relative probability of each, but it doesn’t, a prior should be over all valid explanations of the prior evidence. Reference classes come in because they’re evidence about base rates of particular causal structures; you can say ‘given the propensity for the world to look this way, how should I be correcting the probability of the hypotheses under consideration? Which new hypotheses should I be explicitly tracking?’
I can see where the original post might have gone astray. People have limits on what they can think about and it’s normal to narrow one’s consideration to the top most likely hypothesis. But it’s important to be aware of what you’re approximating here, else you get into a confusion where you have two valid reference classes and you start telling people that there’s a correct one to start arguing from.
… but that still leaves the problem of which prior distribution should be used.
I agree this is an interesting philosophical question but again I’m not sure why you’re bringing it up.
Given your link maybe you think me mentioning Bayes was referring to some method of selecting a single final hypothesis? I’m not, I’m using it to refer to the Bayesian update rule.
It seems the updating rule doesn’t tell you anything about the original argument even when you view information about reference classes as evidence rather than as a method of assigning prior probabilities to hypotheses. Or does it? Can you rephrase the argument in a proper Bayesian way such that it becomes clearer? Note that how strongly some evidence confirms or disconfirms a hypothesis also depends on a prior.
What argument are you referring to when you say “doesn’t tell you anything about the original argument”?
My framing is basically this: you generally don’t start a conversation with someone as a blank pre-priors slate that you get to inject your priors into. The prior is what you get handed, and then the question is how people should respond to the evidence and arguments available. Well, you should use (read: approximate) the basic Bayesian update rule: hypotheses where an observation is unlikely are that much less probable.
I meant leogao’s argument above.
I think you’re underestimating the inferential gap here. I’m not sure why you’d think the Bayes updating rule is meant to “tell you anything about” the original post. My claim was that the whole proposal about selecting reference classes was framed badly and you should just do (approximate) Bayes instead.
And what would this look like? Can you reframe the original argument accordingly?
It’s just Bayes, but I’ll give it a shot.
You’re having a conversation with someone. They believe certain things are more probable than other things. They mention a reference class: if you look at this grouping of claims, most of them are wrong. Then you consider the set of hypotheses: under each of them, how plausible is it given the noted tendency for this grouping of claims to be wrong? Some of them pass easily, eg. the hypothesis that this is just another such claim. Some of them less easily; they are either a modal part of this group and uncommon on base rate, or else nonmodal or not part of the group at all. You continue, with maybe a different reference class, or an observation about the scenario.
Hopefully this illustrates the point. Reference classes are just evidence about the world. There’s no special operation needed for them.
I think the group of people “claiming the end of the world” in the case of AI x-risk is importantly more credentialed and reasonable-looking than most prior claims about the end of the world. From the reference class and general heuristics perspective that you’re talking about[1], I think how credible looking the people are is pretty important.
So, I think the reference class is more like claims of nuclear armageddon than cults. (Plausibly near maximally alarmist climate people are in a similar reference class.)
IDK how I feel about this perspective overall.
I agree this reference class is better, and implies a higher prior, but I think it’s reasonable for the prior over “arbitrary credentialed people warning about something” to be still relatively low in an absolute sense- lots of people have impressive sounding credentials that are not actually good evidence of competence (consider: it’s basically a meme at this point that whenever you see a book where the author puts “PhD” after their name, they probably are a grifter / their phd was probably kinda bs), and also there is a real negativity bias where fearmongering is amplified by both legacy and social media. Also, for the purposes of understanding normal people, it’s useful to keep in mind that trust in credentials and institutions is not very high right now in the US among genpop.
This all seems wrongheaded to me.
I endeavor to look at how things work and describe them accurately. Similarly to how I try to describe how a piece of code works, or how to to build a shed, I will try to accurately describe the consequences of large machine learning runs, which can include human extinction.
This isn’t how I think about things. Reality is what exists, and if a claim accurately describes reality, then I should not want to hold it to higher standards than claims that do not describe reality. I don’t think it’s a good epistemology to rank claims by “bigness” and then say that the big ones are less likely and need more evidence. On the contrary, I think it’s worth investing more in finding out if they’re right, and generally worth bringing them up to consideration with less evidence than for “small” claims.
I don’t have the experiences you’re describing. I don’t go to churches, I don’t visit cults, I was 3yrs old in the year 2000, I was 11 for the ’08 financial crash and having read about it as an adult I don’t recall extinction being a topic of discussion, I think I have heard of climate people saying that via alarmist news headlines but I have not had anyone personally try to convince me of this or even say that they believe it. I have heard it discussed for nuclear winter, yes, and I think nukes are quite scary and it was reasonable to consider, I did not dismiss it out of hand and wouldn’t use that heuristic. I don’t know what the oil thing is.
In other words, I don’t recall anyone seriously trying to convince me that the world was ending except in cases where they had good reason to believe it. In my life, when people try to warn me about big things, especially if they’ve given it serious thought, usually I’ve found it’s been worthwhile for me to consider it. (I like to think I am good at steering clear of scammers and cranks, so that I can trust the people in my life when they tell me things.)
The sense I get from this post is that, in it, you’re assuming everyone else in the world is constantly being assaulted with claims meant to scare and control them rather than people attempting to describe the world accurately. I agree there are forces doing that, but I think this post gives up all too quickly on there being other forces in the world that aren’t doing that that people can recognize and trust.
i am also trying to accurately describe reality. what i’m saying is, even from the perspective of someone smart and truth-seeking but who doesn’t know much about the object-level, it is very reasonable to use bigness of claim as a heuristic for how much evidence you need before you’re satisfied, and that if you don’t do this, you will be worse at finding the truth in practice. my guess is this applies even more so to the average person.
i think this is very analogous to occam’s razor / trust region optimization. clearly, we need to discount theories based on complexity because there are exponentially more complex theories compared to simple ones, many of which have no easily observable difference to the simpler ones, opening you up to being pwned. and empirically it seems a good heuristic to live life by. complex theories can still be true! but given two theories that both accurately describe reality, you want the simpler one. similarly, given two equally complex claims that accurately describe the evidence, you want the one that is less far fetched from your current understanding of the world / requires changing less of your worldview.
also, it doesn’t have to be something you literally personally experienced. it’s totally valid to read the wikipedia page on the branch davidians or whatever and feel slightly less inclined to take things that have similar vibes seriously, or even to absorb the vibe from your environs (your aversion to scammers and cranks surely did not come ex nihilo, right?)
for most of the examples i raised, i didn’t necessarily mean the claim was literally 100% human extinction, and i don’t think it matters that it wasn’t. first, because the important thing is the vibe of the claim (catastrophic) - since we’re talking about heuristics on how seriously to take things that you don’t have time to deep dive on, the rule has to be relatively cheap to implement. i think most people, even quite smart people, genuinely don’t feel much of an emotional difference between literal human extinction vs collapse of society vs half of people dying painfully, unless they first spend a half hour carefully thinking about the implications of extinction. (and even then depending on their values they may still not feel a huge difference)
also, it would be really bad if you could weasel your way out of a reference class that easily; it would be rife for abuse by bad actors—“see, our weird sect of christianity claims that after armageddon, not only will all actual sinners’ souls be tortured forever, but that the devil will create every possible sinner’s soul to torture forever! this is actually fundamentally different from all existing christian theories, and it would be unfathomably worse, so it really shouldn’t be thought of as the same kind of claim”
even if most people are trying to describe the world accurately (which i think is not true and we only get this impression because we live in a strange bubble of very truth seeking people + are above-average capable at understanding things object level and therefore quickly detecting scams), ideas are still selected for memeticness. i’m sure that 90% of conspiracy theorists genuinely believe that humanity is controlled by lizards and are trying their best to spread what they believe to be true. many (not all) of the worst atrocities in history have been committed by people who genuinely thought they were on the side of truth and good.
(actually, i think people do get pwned all the time, even in our circles. rationalists are probably more likely than average (controlling for intelligence) to get sucked into obviously culty things (e.g zizians), largely because they don’t have the memetic antibodies needed to not get pwned, for one reason or another. so probably many rationalists would benefit from evaluating things a little bit more on vibes/bigness and a little bit less on object level)
Your points about Occam’s razor have got nothing to do with this subject[1]. The heuristic “be more skeptical of claims that would have big implications if true” makes sense only when you suspect a claim may have been adversarially optimized for memetic fitness; it is not otherwise true that “a claim that something really bad is going to happen is fundamentally less likely to be true than other claims”.
I’m having a little trouble connecting your various points back to your opening paragraph, which is the primary thing that I am trying to push back on.[2]
To restate the message I’m reading here: “Give up on having a conversation where you evaluate the evidence alongside your interlocutors. Instead frame yourself as trying to convince them of something, and assume that they are correct to treat your communications as though you are adversarially optimizing for them believing whatever you want them to believe.” This assumption seems to give up a lot of my ability to communicate with people (almost ~all of it), and I refuse to simply do it because some amount of communication in the world is adversarially optimized, and I’m definitely not going to do it because of a spurious argument that Occam’s razor implies that “claims about things being really bad or claims that imply you need to take action are fundamentally less likely to be true”.
You are often in an environment where people are trying to use language to describe reality, and in that situation the primary thing to evaluate is not the “bigness” of a claim, but the evidence for and against it. I recommend instead to act in such a way as to increase the size and occurrence of that environment more-so than “act as though it’s correct to expect maximum adversarial optimization in communications”.
(Meta: The only literal quotes of Leo’s in this comment are the big one in the quote block, my use of “” is to hold a sentence as object, they are not things Leo wrote.)
I agree that the more strongly a claim implies that you should take action, then the more you should consider that it is being optimized adversarially for you to take action. For what it’s worth, I think that heuristic applies more so to claims that you should personally take action. Most people have little action to directly prevent the end of the world from AI; this is a heuristic more naturally applied to claims that you need to pay fines (which are often scams/spam). But mostly, when people give me claims that imply action, they are honestly meant claims and I do the action. This is the vast majority of my experience.
Aside to Leo: Rather than reply point-by-point to the each of the paragraphs in the second comment, I will try restating and responding to the core message I got in the opening paragraph of the first comment. I’m doing this because the paragraphs in the second-comment seemed somewhat distantly related / I couldn’t tell whether the points were actually cruxy. They were responding to many different things, and I hope restating the core thing will better respond to your core point. However I don’t mean to avoid key arguments, if you think I have done so feel free to tell me one or two paragraphs you would especially like me to engage with and I will do so in any future reply.
in practice many of the claims you hear will be optimized for memetic fitness, even if the people making the claims are genuine. well intentioned people can still be naive, or have blind spots, or be ideologically captured.
also, presumably the people you are trying to convince are on average less surrounded by truth seeking people than you are (because being in the alignment community is strongly correlated with caring about seeking truth).
i don’t think this gives up your ability to communicate with people. you simply have to signal in some credible way that you are not only well intentioned but also not merely the carrier of some very memetic idea that slipped past your antibodies. there are many ways to accomplish this. for example, you can build up a reputation of being very scrupulous and unmindkilled. this lets you convey ideas freely to other people in your circles that are also very scrupulous and unmindkilled. when interacting with people outside this circle, for whom this form of reputation is illegible, you need to find something else. depending on who you’re talking to and what kinds of things they take seriously, this could be leaning on the credibility of someone like geoff hinton, or of sam/demis/dario, or the UK government, or whatever.
this might already be what you’re doing, in which case there’s no disagreement between us.
You’re writing lots of things here but as far as I can tell you aren’t defending your opening statement, which I believe is mistaken.
Firstly, it’s just not more reasonable. When you ask yourself “Is a machine learning run going to lead to human extinction?” you should not first say “How trustworthy are people who have historically claimed the world is ending?”, you should of course primarily bring your attention to questions about what sorts of machine is being built, what sort of thinking capacities it has, what sorts of actions it can take in the world, what sorts of optimization it runs, how it would behave around humans if it were more powerful than them, and so on. We can go back to discussing epistemology 101 if need be (e.g. “Hug the Query!”).
Secondly, insofar as someone believes you are a huckster or a crackpot, you should leave the conversation, communication here has broken down and you should look for other communication opportunities. However, insofar as someone is only evaluating this tentatively as one of many possible hypotheses about you then you should open yourself up to auditing / questioning by them about why you believe what you believe and your past history and your memetic influences. Being frank is the only way through this! But you shouldn’t say to them “Actually, I think you should treat me like a huckster/scammer/serf-of-a-corrupt-empire.” This feels analogous to a man on a date with a woman saying “Actually I think you should strongly privilege the hypothesis that I am willing to rape you, and now I’ll try to provide evidence for you that this is not true.” It would be genuinely a bad sign about a man that he thinks that about himself, and also he has moved the situation into a much more adversarial frame.
I suspect you could write some more narrow quick-take such as “Here is some communication advice I find helpful when talking with friends and colleagues about how AI can lead to human extinction”, but in generalizing it all the way to making dictates about basic epistemology you are making basic mistakes and getting it wrong.
Please either (1) defend and/or clarify the original statement, or (2) concede that it was mistaken, rather than writing more semi-related paragraphs about memetic immune systems.
But you should absolutely ask “does it look like I’m making the same mistakes they did, and how would I notice if it were so?” Sometimes one is indeed in a cult with your methods of reason subverted, or having a psychotic break, or captured by a content filter that hides the counterevidence, or many of the more mundane and pervasive failures in kind.
I am confused why you think my claims are only semi related. to me my claim is very straightforward, and the things i’m saying are straightforwardly converying a world model that seems to me to explain why i believe my claim. i’m trying to explain in good faith, not trying to say random things. i’m claiming a theory of how people parse information, to justify my opening statement, which i can clarify as:
sometimes, people use the rhetorical move of saying something like “people think 95% doom is overconfident, yet 5% isn’t. but that’s also being 95% confident in not-doom, and yet they don’t consider that overconfident. curious.” followed by “well actually, it’s only a big claim under your reference class. under mine, i.e the set of all instances of a more intelligent thing emerging, actually, 95% doom is less overconfident than 5% doom” this post was inspired by seeing one such tweet, but i see such claims like this every once in a while that play reference class tennis.
i think this kind of argument is really bad at persuading people who don’t already agree (from empirical observation). my opening statement is saying “please stop doing this, if you do it, and thank you for not doing this, if you dont already do it” the rest of my paragraphs provide an explanation of my theory for why this is bad for changing people’s minds. this seems pretty obviously relevant for justifying why we should stop doing the thing. i sometimes see people out there talk like this (including my past self at some point), and then fail to convince people, and then feel very confused about why people don’t see the error of their ways when presented with an alternative reference class. if my theory is correct (maybe it isn’t, this isn’t a super well thought out take, it’s more a shower thought), then it would explain this, and people who are failing to convince people would probably want to know why they’re failing. i did not spell this out in my opening statement because i thought it was clear but in retrospect this was not clear from the opening statement
i don’t think the root cause is people being irrational epistemically. i think there is a fundamental reason why people do this that is very reasonable. i think you disagree with this on the object level and many of my paragraphs are attempting to respond to what i view as the reason you disagree. this does not explicitly show up in the opening statement, but since you disagree with this, i thought it would make sense to respond to that too
i am not saying you should explicitly say “yeah i think you should treat me as a scammer until i prove otherwise”! i am also not saying you should try to argue with people who have already stopped listening to you because they think you’re a scammer! i am merely saying we should be aware that people might be entertaining that as a hypothesis, and if you try to argue by using this particular class of rhetorical move, you will only trigger their defenses further, and that you should instead just directly provide the evidence for why you should be taken seriously, in a socially appropriate manner. if i understand correctly, i think the thing you are saying one should do is the same as the thing i’m saying one should do, but phrased in a different way; i’m saying not to do a thing that you seem to already not be doing.
i think i have not communicated myself well in this conversation, and my mental model is that we aren’t really making progress, and therefore this conversation has not brought value and joy into the world in the way i intended. so this will probably be my last reply, unless you think doing so would be a grave error.
This seems wrong to me.
a. More smaller things happen and there are fewer kinds of smaller thing that happen.
b. I bet people genuinely have more evidence for small claims they state than big ones on average.
c. The skepticism you should have because particular claims are frequently adversarially generated shouldn’t first depend on deciding to be skeptical about it.
If you’ll forgive the lack of charity, ISTM that leogao is making IMO largely true points about the reference class and then doing the wrong thing with those points, and you’re reacting to the thing being done wrong at the end, but trying to do this in part by disagreeing with the points being made about the reference class. leogao is right that people are reasonable in being skeptical of this class of claims on priors, and right that when communicating with someone it’s often best to start within their framing. You are right that regardless it’s still correct to evaluate the sum of evidence for and against a proposition, and that other people failing to communicate honestly in this reference class doesn’t mean we ought to throw out or stop contributing to the good faith conversations avaialable to us.
Thanks for the comment. (Upvoted.)
a. I expect there is a slightly more complicated relationship between my value-function and the likely configuration states of the universe than literally zero-correlation, but most configuration states do not support life and we are all dead, so in one sense a claim that in the future something very big and bad will happen is far more likely on priors. One might counter that we live in a highly optimized society where things being functional and maintained is an equilibrium state and it’s unlikely for systems to get out of whack enough for bad things to happen. But taking this straightforwardly is extremely naive, tons of bad things happen all the time to people. I’m not sure whether to focus on ‘big’ or ‘bad’ but either way, the human sense of these is not what the physical universe is made out of or cares about, and so this looks like an unproductive heuristic to me.
b. On the other hand, I suspect the bigger claims are more worth investing time to find out if they’re true! All of this seems too coarse-grained to produce a strong baseline belief about big claims or small claims.
c. I don’t get this one. I’m pretty sure I said that if you believe that you’re in a highly adversarial epistemic environment, then you should become more distrusting of evidence about memetically fit claims.
I don’t know what true points you think Leo is making about “the reference class”, nor which points you think I’m inaccurately pushing back on that are true about “the reference class” but not true of me. Going with the standard rationalist advice, I encourage everyone to taboo “reference class” and replace it with a specific heuristic. It seems to me that “reference class” is pretending that these groupings are more well-defined than they are.
Well, sure, it’s just you seemed to frame this as a binary on/off thing, sometimes you’re exposed and need to count it and sometimes you’re not, whereas to me it’s basically never implausible that a belief has been exposed to selection pressures, and the question is of probabilities and degrees.
i’m not even saying people should not evaluate evidence for and against a proposition in general! it’s just that this is expensive, and so it is perfectly reasonable to have heuristics to decide which things to evaluate, and so you should first prove with costly signals that you are not pwning them, and then they can weigh the evidence. and until you can provide enough evidence that you’re not pwning them for it to be worth their time to evaluate your claims in detail, that it should not be surprising that many people won’t listen to the evidence; and that even if they do listen, if there is still lingering suspicion that they are being pwned, you need to provide the type of evidence that could persuade someone that they aren’t getting pwned (for which being credibly very honest and truth seeking is necessary but not sufficient), which is sometimes different from mere compellingness of argument
I think the framing that sits better to me is ‘You should meet people where they’re at.’ If they seem like they need confidence that you’re arguing from a place of reason, that’s probably indeed the place to start.
I think you’re correct. There’s a synergistic feedback loop between alarmism and social interaction that filters out pragmatic perspectives. Creating the illusion that the doom surrounding any given topic more prevalent than it really is, or even that it’s near universal.
Even before the rise of digital information the feedback phenomenon could be observed in any insular group. In today’s environment where a lot of effort goes into exploiting that feedback loop it requires a conscious effort to maintain perspective, or even remain aware that there are other perspectives.
# AI and the Future of Personalized Education: A Paradigm Shift in Learning
Recently, I’ve been exploring the theory of computation. With the rapid advancement of artificial intelligence—essentially a vast collection of algorithms and computational instructions designed to process inputs and generate outputs—I find myself increasingly curious about the fundamental capabilities and limitations of computation itself. Concepts such as automata, Turing machines, computability, and complexity frequently appear in discussions about AI, yet my understanding of these topics is still developing. I recently encountered fascinating articles by Stephen Wolfram, including [Observer Theory](https://writings.stephenwolfram.com/2023/12/observer-theory/) and [A New Kind of Science: A 15-Year View](https://writings.stephenwolfram.com/2017/05/a-new-kind-of-science-a-15-year-view/). Wolfram presents intriguing ideas, such as the claim that beyond a certain minimal threshold, nearly all processes—natural or artificial—are computationally equivalent in sophistication, and that even the simplest rules (like cellular automaton Rule 30) can produce irreducible, unpredictable complexity.
Before the advent of AI tools, my approach to learning involved selecting a relevant book, reading through it, and working diligently on exercises. A significant challenge in self-directed learning is the absence of immediate guidance when encountering difficulties. To overcome this, I would synthesize information from various sources—books, online resources, and Q&A platforms like Stack Overflow—to clarify my doubts. Although rewarding, as it encourages the brain to form connections and build new knowledge, this process is undeniably time-consuming. Imagine if we could directly converse with the author of a textbook—transforming the author into our personal teacher would greatly enhance learning efficiency.
In my view, an effective teacher should possess the following qualities:
- Expertise in the subject matter, with a depth of knowledge significantly greater than that of the student, and familiarity with related disciplines to provide a comprehensive understanding.
- A Socratic teaching style, where the teacher guides students through questions, encourages active participation, corrects misconceptions, and provides constructive feedback. The emphasis should be on the learning process rather than merely arriving at the correct answer.
- An ability to recognize and address the student’s specific misunderstandings, adapting teaching methods to suit the student’s individual learning style and level.
Realistically, not all teachers I’ve encountered meet these criteria. Good teachers are scarce resources, which explains why parents invest heavily in quality education and why developed countries typically have more qualified teachers than developing ones.
With the emergence of AI tools, I sense a potential paradigm shift in education. Rather than simply asking AI to solve problems, we can leverage AI as a personalized teacher. For undergraduate-level topics, AI already surpasses the average classroom instructor in terms of breadth and depth of knowledge. AI systems effectively function as encyclopedias, capable of addressing questions beyond the scope of typical educators. Moreover, AI can be easily adapted to employ a Socratic teaching approach. However, current AI still lacks the nuanced ability to fully understand a student’s individual learning style and level. It relies heavily on the learner’s self-awareness and reflection to identify gaps in understanding and logic, prompting the learner to seek clarification. This limitation likely arises because large language models (LLMs) are primarily trained to respond to human prompts rather than proactively prompting humans to think critically.
Considering how AI might reshape education, I offer the following informal predictions:
- AI systems will increasingly be trained specifically as teachers, designed to prompt learners through Socratic questioning rather than simply providing direct answers. A significant challenge will be creating suitable training environments and sourcing data that accurately reflect the learning process. Potential training resources could include textbooks, Q&A platforms like Stack Overflow and Quora, and educational videos from Khan Academy and MIT OpenCourseWare.
- AI-generated educational content will become dynamic and personalized, moving beyond traditional chatbot interactions. Similar to human teachers, AI might illustrate concepts through whiteboard explanations, diagrams, or even programming demonstrations. Outputs could include text, images, videos, or interactive web-based experiences.
- The number of AI teachers will vastly exceed the number of human teachers, significantly reducing the cost of education. This transformation may occur before 2028, aligning with predictions outlined in [AI-2027](https://ai-2027.com/).
In a hypothetical future where AI can perform every cognitive task, will humans still need to learn? Will we still require teachers? If AI remains friendly and supportive, I believe human curiosity will persist, though the necessity for traditional learning may diminish significantly. Humans might even use AI to better understand AI itself. Conversely, if AI were to become adversarial, perhaps humans would still have roles to fulfill, necessitating AI to teach humans the skills required for these tasks.
This is known in computer science as Turing completeness.
As a former teacher, I 100% agree.
I am not sure I understand this part. How specifically does being a developed country increase the number of teachers able to do the Socratic method etc.? (I could make a guess, but I am interested in your interpretation.)
In 2021, Daniel Ellsberg leaked US govt plans to make a nuclear first strike on China in 1958 due to Taiwan conflict.
Daniel Ellsberg copied these papers more than 50 years ago but only released them now because he thought another conflict over Taiwan may be possible soon.
Unredacted report here
Thought it might be interesting to share.
As usual, seems clear Dulles was more interested in escalating the conflict (in this case, to nuclear) than Eisenhower.
One thing that confused me about transformers is the question of when (as in, after how many layers) each embedding “flips” from representing the original token to finally representing the prediction of the next token.
By now, I think the answer is simply this: each embedding represents both at the same time (and more). For instance, in GPT3 there are 12,288 embedding dimensions. At first I thought that all of them initially encode the original token, and after going through all the layers they eventually all encode the next token, and somewhere in the layers between this shift must happen. But what, upon some reflection, makes much more sense would be something very roughly like, say:
some 1000 dimensions encode the original token
some other 1000 dimensions encode the prediction of the next token
the remaining 10,288 dimensions encode information about all available context (which will start out “empty” and get filled with meaningful information through the layers).
In practice, things are of course much less clean, and probably most dimensions will have some role in all these things, to different degrees, as of course all of this is learned through gradient descent and hence will be very noisy and gradual. Additionally, there’s the whole positional encoding thing which is also part of the embeddings and makes clear distinctions even more difficult. But the key point remains that a single embedding encodes many things, only one of which is the prediction, and this prediction is always there from the beginning (when it’s still very superficial and bad) and then, together with the rest of the embedding, gets refined more and more throughout the layers.
Another misconception I had was that embedding and unembedding are very roughly symmetric operations that just “translate” from token space to embedding space and vice versa[1]. This made sense in relation to the initial & naive “embeddings represent tokens” interpretation, but with the updated view as described above, it becomes clear that unembedding is rather an “extraction” of the information content in the embedding that encodes the prediction.
One piece of evidence for this updated view is that this paper (thanks to Leon Lang for the hint) found that “Zero layer transformers model bigram statistics”. So, indeed, embedding + unembedding alone already perform some very basic next-token prediction. (Admittedly I’m not sure if this is only the case when the transformer is trained with zero layers, or also in, say, GPT3, when during inference you just skip all the layers)
I would guess that transformer-experienced people (unless they disagree with my description—in that case, please elaborate what I’m still getting wrong) will find all of this rather obvious. But for me, this was a major missing piece of understanding, even after once participating in an ML-themed bootcamp and watching all the 3Blue1Brown videos on transformers several times, where this idea either is not directly explained, or I somehow managed to consistently miss it.
Of course, this is not entirely true to begin with because the unembedding yields a distribution rather than a single token. But my assumption was that, if you embed the word “Good” and then unembed the embedding immediately, you would get a very high probability for “Good” back when in practice (I didn’t verify this yet) you would probably obtain high probabilities for “morning”, “day” etc.
You can learn a per-token bias over all the layers to understand where in the model it stops representing the original embedding (or a linear transformation of it) like in https://www.lesswrong.com/posts/P8qLZco6Zq8LaLHe9/tokenized-saes-infusing-per-token-biases
You could also plot the cos-sims of the resulting biases to see how much it rotates.
There has actually been some work visualizing this process, with a method called the “logit lens”.
The first example that I know of: https://www.lesswrong.com/posts/AcKRB8wDpdaN6v6ru/interpreting-gpt-the-logit-lens
A more thorough analysis: https://arxiv.org/abs/2303.08112
Awkwardly, it depends on whether the model uses tied embeddings (unembed is embed transpose) or has separate embed and unembed matrices. Using tied embedding matrices like this means the model actually does have to do a sort of conversion.
Your discussion seems mostly accurate in the case of having separate embed and unembed, except that I don’t think the initial state is like “1k encode current, 1k encode predictions, rest start empty”. The model can just directly encode predictions for an initial state using the unembed.
Do it! I bet slightly against your prediction.
Eat your caffeine
…instead of drinking it. I recommend these.
They have the same dosage as a cup of coffee (~100mg).
You can still drink coffee/Diet Coke/tea, just get it without caffeine. Coke caffeine-free, decaf coffee, herbal tea.
They cost ~60¢ per pill [EDIT: oops, it’s 6¢ per pill — thanks @ryan_greenblatt] vs ~$5 for a cup of coffee — that’s about an order of magnitude cheaper.
You can put them in your backpack or back pocket or car. They don’t go bad, they’re portable, they won’t spill on your clothes, they won’t get cold.
Straight caffeine makes me anxious. L-Theanine makes me less anxious. The caffeine capsules I linked above have equal parts caffeine and L-Theanine.
Also:
Caffeine is a highly addictive drug; you should treat it like one. Sipping a nice hot beverage doesn’t make me feel like I’m taking a stimulant in the way that swallowing a pill does.
I don’t know how many milligrams of caffeine were in the last coffee I drank. But I do know exactly the amount of caffeine in every caffeine pill I’ve ever taken. Taking caffeine pills prevents accidentally consuming way too much (or too little) caffeine.
I don’t want to associate “caffeine” with “tasty sugary sweet drink,” for two reasons:
A lot of caffeinated beverages contain other bad stuff. You might not by-default drink a sugary soft drink if it weren’t for the caffeine, so disambiguating the associations in your head might cause you to eat your caffeine and not drink the soda.
Operant conditioning works by giving positive reinforcement to certain behaviors, causing them to happen more frequently. Like, for instance, giving someone a sugary soft drink every time they take caffeine. But when I take caffeine, I want to to be taking it because of a reasoned decision-making process minimally swayed by factors not under my control. So I avoid giving my brain a strong positive association with something that happens every time it experiences caffeine (e.g. a sugary soft drink). Caffeine is addictive enough! Why should I make the Skinner box stronger?
If you can’t take pills, consider getting caffeine patches — though I’ve never tried them, so can’t give it my personal recommendation.
Disclaimers:
Generic caveats.
Caffeine is a drug. I’m not a doctor, take caffeine at your own risk, this is not medical advice.
This post does not take a stance on whether or not you should take caffeine; the stance that it takes is, conditional on your already having decided to take caffeine, you should take it in pill form (instead of in drink form).
Are you buying your coffee from a cafe every day or something? You can buy a pack of nice grounds for like $13, and that lasts more than a month (126 Tbsp/pack / (3 Tbsp/day) = 42 days/pack), totaling 30¢/day. Half the cost of a caffeine pill. And that’s if you don’t buy bulk.
i’m not (i don’t buy caffeinated drinks!), but the people i’m responding to in this post are. in particular, i often notice people go from “i need caffeine” → “i’ll buy a {coffee, tea, energy drink, etc}” — for example, college students, most of whom don’t have the wherewithal to go to the effort of making their own coffee.
It’s actually $0.06 / pill, not $0.60. Doesn’t make a big difference to your bottom line though as both costs are cheap.
thanks, edited!
One question I’m curious about: do these pills have less or no effects on your bowels compared to what a coffee cup can? Is it something about the caffeine in itself, something else, or the mode of absorption? If they ditch those effects then I’m genuinely interested.
I second that.
If you take caffeine regularly, I also recommend experimenting with tolerance build-up, which the pill form makes easy. You want to figure out the minimal number of days N such that if you don’t take caffeine every N days, you don’t develop tolerance. For me, N turned out to be equal to 2: if I take 100 mg of caffeine every second day, it always seems to have its full effect (or tolerance develops very slowly; and you can “reset” any such slow creep-up by quitting caffeine for e. g. 1 week every 3 months).
You can test that by taking 200 mg at once[1] after 1-2 weeks of following a given intake schedule. If you end up having a strong reaction (jitteriness, etc., it’s pretty obvious, at least in my experience), you haven’t developed tolerance. If the reaction is only about as strong as taking 100 mg on a toleranceless stomach[2], then you have.
(Obviously the real effects are probably not so neatly linear, and it might work for you differently. But I think the overarching idea of testing caffeine tolerance build-up by monitoring whether the rather obvious “too much caffeine” point moved up or not, is an approach with a much better signal/noise ratio than doing so via e. g. confounded cognitive tests.)
Once you’ve established that, you can try more complicated schemes. E. g., taking 100 mg on even days and 200 mg on odd days. Some caffeine effects are plausibly not destroyed by tolerance, so this schedule lets you reap those every day, and have full caffeine effects every second day. (Again, you can test for nonlinear tolerance build-up effects by following this schedule for 1-2 weeks, then taking a larger dose of 300-400 mg[3], and seeing where its effect lies on the “100 mg on a toleranceless stomach” to “way too much caffeine” spectrum.)
Assuming it’s safe for you, obviously.
You can establish that baseline by stopping caffeine intake for 3 weeks, then taking a single 100 mg dose. You probably want to do that anyway for the N-day experimentation.
Note that this is even more obviously dangerous if you have any health problems/caffeine contraindications, so this might not work for you.
Counterargument: sure, good decaf coffee exists, but it’s harder to get hold of. Because it’s less popular, the decaf beans at cafés are often less fresh or from a worse supplier. Some places don’t stock decaf coffee. So if you like the taste of good coffee, taking caffeine pills may limit the amount of good coffee you can access and drink without exceeding your desired dose.
And good decaf black tea is even harder to get…
As a black tea enjoyer I would argue it’s practically non existent, no decaf black tea I’d ever tried even comes close to the best “normal” black tea sorts.
This is true of all teas. The decaf ones all are terrible. I spent a while trying them in the hopes of cutting down my caffeine consumption, but the taste compromise is severe. And I’d say that the black decaf teas were the best I tried, mostly because they tend to have much more flavor & flavorings, so there was more left over from the water or CO2 decaffeination...
Also consider modafinil
there are plenty of other common stimulants, but caffeine is by far the most commonly used — and also the most likely to be taken mixed into a tasty drink, rather than in a pill.
Beware mistaking a “because” for an “and”. Sometimes you think something is X and Y, but it turns out to be X because Y.
For instance, I was recently at a metal concert, and helped someone off the ground in a mosh pit. Someone thanked me afterwards but to me it seemed like the most obvious thing in the world.
A mosh pit is not fun AND a place where everyone helps each other. It is fun BECAUSE everyone helps each other. Play-acting aggression while being supportive is where the fun is born.
If you don’t believe in your work, consider looking for other options
I spent 15 months working for ARC Theory. I recently wrote up why I don’t believe in their research. If one reads my posts, I think it should become very clear to the reader that either ARC’s research direction is fundamentally unsound, or I’m still misunderstanding some of the very basics after more than a year of trying to grasp it. In either case, I think it’s pretty clear that it was not productive for me to work there. Throughout writing my posts, I felt an intense shame imagining readers asking the very fair question: “If you think the agenda is so doomed, why did you keep working on it?”[1]
In my first post, I write: “Unfortunately, by the time I left ARC, I became very skeptical of the viability of their agenda.”This is not quite true. I was very skeptical from the beginning, for largely similar reasons I expressed in my posts. But first I told myself that I should stay a little longer. Either they manage to convince me that the agenda is sound, or I demonstrate that it doesn’t work, in which case I free up the labor of the group of smart people working on the agenda. I think this was initially a somewhat reasonable position, though it was already in large part motivated reasoning.
But half a year after joining, I don’t think this theory of change was very tenable anymore. It was becoming clear that our arguments were going in circles. I couldn’t convince Paul and Mark (the two people thinking the most about the big picture questions), nor could they convince me. Eight months in, two friends visited me in California, and they noticed that I always derailed the conversation when they asked me about my research. I think that should have been an important thing to notice that I was ashamed to talk about my research to my friends, because I was afraid they would see how crazy it was. I should have quit then, but I stayed for another seven months.
I think this was largely due to cowardice. I’m very bad at coding and all my previous attempts at upskilling in coding went badly.[2] I thought of my main skill as being a mathematician, and I wanted to keep working on AI safety. The few other places one can work as a mathematician in AI safety looked even less promising to me than ARC. I was afraid that if I quit, I wouldn’t find anything else to do.
In retrospect, this fear was unfounded. I realized there were other skills one can develop, not just coding. In my afternoons, I started reading a lot more papers and serious blog posts [3] from various branches of AI safety. After a few months, I felt I had much more context on many topics. I started to think more about what I can do with my non-mathematical skills. When I finally started applying for jobs, I got an offer from the European AI Office and UKAISI, and it looked more likely than not that I would get an offer from Redwood. [4]
Other options I considered that looked less promising than the three above, but still better than staying at ARC: Team up with some Hungarian coder friends and execute some simple but interesting experiments I had vague plans for. [5] Assemble a good curriculum for the prosaic AI safety agendas that I like. Apply for a grant-maker job. Become a Joe Carlsmith-style general investigator. Try to become a journalist or an influential blogger. Work on crazy acausal trade stuff.
I still think many of these were good opportunities, and probably there are many others. Of course, different options are good for people with different skill profiles, but I really believe that the world is ripe with opportunities to be useful for people who are generally smart and reasonable and have enough context on AI safety. If you are working on AI safety but don’t really believe that your day-to-day job is going anywhere, remember that having context and being ingrained in the AI safety field is a great asset in itself,[6] and consider looking for other projects to work on.
(Important note: ARC was a very good workplace, my coworkers were very nice to me and receptive to my doubts, and I really enjoyed working there except for feeling guilty that my work is not useful. I’m also not accusing the people who continue working at ARC of being cowards in the way I have been. They just have a different assessment of ARC’s chances, or work on lower-level questions than I have, where it can be reasonable to just defer to others on the higher-level questions.)
(As an employee of the European AI Office, it’s important for me to emphasize this point: The views and opinions of the author expressed herein are personal and do not necessarily reflect those of the European Commission or other EU institutions.)
No, really, it felt very bad writing the posts. It felt like describing how I worked for a year on a scheme that was either trying to build perpetuum mobile machines, or trying to build normal cars, I just missed the fact that gasoline exists. Embarrassing either way.
I don’t know why. People keep telling me that it should be easy to upskill, but for some reason it is not.
I particularly recommend Redwood’s blog.
We didn’t fully finish the work trial as I decided that the EU job was better.
Think of things in the style of some of Owain Evans’ papers or experiments on faithful chain of thought.
And having more context and knowledge is relatively easy to further improve by reading for a few months. It’s a young field.
How exactly are you measuring coding ability? What are the ways you’ve tried to upskill, and what are common failure modes? Can you describe your workflow at a high-level, or share a recording? Are you referring to competence at real world engineering tasks, or performance on screening tests?
There’s a chrome extension which lets you download leetcode questions as jupyter notebooks: https://github.com/k-erdem/offlineleet. After working on a problem, you can make a markdown cell with notes and convert it into flashcards for regular review: https://github.com/callummcdougall/jupyter-to-anki.
I would suggest scheduling calls with friends for practice sessions so that they can give you personalized feedback about what you need to work on.
I disagree. Instead, I think that either ARC’s research direction is fundamentally unsound, or you’re still misunderstanding some of the finer details after more than a year of trying to grasp it. Like, your post is a few layers deep in the argument tree, and the discussions we had about these details (e.g. in January) went even deeper. I don’t really have a position on whether your objections ultimately point at an insurmountable obstacle for ARC’s agenda, but if they do, I think one needs to really dig into the details in order to see that.
(ETA: I agree with your post overall, though!)
That’s not how I see it. I think the argument tree doesn’t go very deep until I lose the the thread. Here are a few, slightly stylized but real, conversations I had with friends who had no context on what ARC was doing, when I tried to explain our research to them:
Me: We want to to do Low Probability Estimation.
Them: Does this mean you want to estimate the probability that ChatGPT says a specific word after a 100 words on chain of thought? Isn’t this clearly impossible?
Me: No, you see, we only want to estimate the probabilities only as well as the model knows.
Them: What does this mean?
Me: [I can’t answer this question.]
Me: We want to do Mechanistic Anomaly Detection.
Them: Isn’t this clearly impossible? Won’t this result in a lot of false positives when anything out of distribution happens?
Me: Yes, why we have this new clever idea of relying on the fragility of sensor tampering, that if you delete a subset of the actions, you will get an inconsistent image.
Them: What if the AI builds another robot to tamper with the cameras?
Me: We actually don’t want to delete actions but heuristic arguments for why the cameras will show something, and we want to construct heuristic explanations in a way that they carry over through delegated actions.
Them: What does this mean?
Me; [I can’t answer this question.]
Me: We want to create Heuristic Arguments to explain everything the model does.
Them: What does it mean that an argument explained a behavior? What is even the type signature of heuristic arguments? And you want to explain everything a model does? Isn’t this clearly impossible?
Me: [I can’t answer this question.]
When I was explaining our research to outsiders (which I usually tried to avoid out of cowardice), we usually got to some of these points within minutes. So I wouldn’t say these are fine details of our agenda.
During my time at ARC, the majority of my time was spent on asking variations of these three questions from Mark and Paul. They always kindly answered, and the answer was convincing-sounding enough for the moment that I usually couldn’t really reply on the spot, and then I went back to my room to think through their answers. But I never actually understood their answers, and I can’t reproduce them now. Really, I think that was the majority of work I did at ARC. When I left, you guys should have bought a rock with “Isn’t this clearly impossible?” written on it, and that would profitably replace my presence.
That’s why I’m saying that either ARC’s agenda is fundamentally unsound or I’m still missing some of the basics. What is standing between ARC’s agenda collapsing from five minutes of questioning from an outsider is that Paul and Mark (and maybe others in the team) have some convincing-sounding answers to the three questions above. So I would say that these answers are really part of the basics, and I never understood them.
Maybe Mark will show up in the comments now to give answers to the three questions, and I expect the answers to sound kind of convincing, and I won’t have a very convincing counter-argument other than some rambling reply saying essentially that “I think this argument is missing the point and doesn’t actually answer the question, but I can’t really point out why, because I don’t actually understand the argument because I don’t understand how you imagine heuristic arguments”. (This is what happened in the comments on my other post, and thanks to Mark for the reply and I’m sorry for still not understanding it.) I can’t distinguish whether I’m just bad at understanding some sound arguments here, or the arguments are elaborate self-delusions of people who are smarter and better at arguments than me. In any case, I feel epistemic learned helplessness on some of these most basic questions in ARC’s agenda.
What is your opinion on the Low Probability Estimation paper published this year at ICLR?
I don’t have a background in the field, but it seems like they were able to get some results, that indicate the approach is able to extract some results. https://arxiv.org/pdf/2410.13211
It’s a nice paper, and I’m glad they did the research, but importantly, the paper reports a negative result about our agenda. The main result is that the method inspired by our ideas under-performs the baseline. Of course, these are just the first experiments, work is ongoing, this is not conclusive negative evidence for anything. But the paper certainly shouldn’t be counted as positive evidence for ARC’s ideas.
Do you think that it would be worth it to try to partially sort this out in a LW dialogue?
IME, in the majority of cases, when I strongly felt like quitting but was also inclined to justify “staying just a little bit longer because XYZ”, and listened to my justifications, staying turned out to be the wrong decision.
Relevant classic paper from Steven Levitt. Abstract [emphasis mine]:
Pretty much the whole causal estimate comes down to the influence of happiness 6 months after quitting a job or breaking up. Almost everything else is swamped with noise. The only individual question with a consistent causal effect larger than the standard error was “should I break my bad habit?”, and doing so made people unhappier. Even for those factors, there’s a lot of biases in this self-report data, which the authors noted and tried to address. I’m just not sure what we can really learn from this, even though it is a fun study.
If you want to upskill in coding, I’m open to tutoring you for money.
I keep seeing the first clause as “I don’t believe in your work”.
Here’s some near-future fiction:
In 2027 the trend that began in 2024 with OpenAI’s o1 reasoning system has continued. The compute required to run AI is no longer negligible compared to the cost of training it. Models reason over long periods of time. Their effective context windows are massive, they update their underlying models continuously, and they break tasks down into sub-tasks to be carried out in parallel. The base LLM they are built on is two generations ahead of GPT-4.
These systems are language model agents. They are built with self-understanding and can be configured for autonomy. These constitute proto-AGI. They are artificial intelligences that can perform much but not all of the intellectual work that humans can do (although even what these AI can do, they cannot necessarily do cheaper than a human could).
In 2029 people have spent over a year working hard to improve the scaffolding around proto-AGI to make it as useful as possible. Presently, the next generation of LLM foundational model is released. Now, with some further improvements to the reasoning and learning scaffolding, this is true AGI. It can perform any intellectual task that a human could (although it’s very expensive to run at full capacity). It is better at AI research than any human. But it is not superintelligence. It is still controllable and its thoughts are still legible. So, it is put to work on AI safety research. Of course, by this point much progress has already been made on AI safety—but it seems prudent to get the AGI to look into the problem and get its go-ahead before commencing with the next training run. After a few months the AI declares it has found an acceptable safety approach. It spends some time on capabilities research then the training run for the next LLM begins.
In 2030 the next LLM is completed, and improved scaffolding is constructed. Now human-level AI is cheap, better-than-human-AI is not too expensive, and the peak capabilities of the AI are almost alien. For a brief period of time the value of human labour skyrockets, workers acting as puppets as the AI instructs them over video-call to do its bidding. This is necessary due to a major robotics shortfall. Human puppet-workers work in mines, refineries, smelters, and factories, as well as in logistics, optics, and general infrastructure. Human bottlenecks need to be addressed. This takes a few months, but the ensuing robotics explosion is rapid and massive.
2031 is the year of the robotics explosion. The robots are physically optimised for their specific tasks, coordinate perfectly with other robots, are able to sustain peak performance, do not require pay, and are controlled by cleverer-than-human minds. These are all multiplicative factors for the robots’ productivity relative to human workers. Most robots are not humanoid, but let’s say a humanoid robot would cost $x. Per $x robots in 2031 are 10,000 more productive than a human. This might sound like a ridiculously high number: one robot the equivalent of 10,000 humans? But let’s do some rough math:
Advantage | Productivity Multiplier (relative to skilled human)
Physically optimised for their specific tasks | 5
Coordinate perfectly with other robots | 10
Able to sustain peak performance | 5
Do not require pay | 2
Controlled by cleverer-than-human minds | 20
5*10*5*2*20 = 10,000
Suppose that a human can construct one robot per year (taking into account mining and all the intermediary logistics and manufacturing). With robots 10^4 times as productive as humans, each robot will construct an average of 10^4 robots per year. This is the robotics explosion. By the end of the year there will be a 10^11 robots (more precisely, an amount of robots that is cost-equivalent to 10^11 humanoid robots).
By 2032 there are 10^11 robots, each with the productivity of 10^4 skilled human workers. That is a total productivity equivalent to 10^15 skilled human workers. This is roughly 10^5 times the productivity of humanity in 2024. At this point trillions of advanced processing units have been constructed and are online. Industry expands through the Solar System. The number of robots continues to balloon. The rate of research and development accelerates rapidly. Human mind upload is achieved.
It’s been 7 months since I wrote the comment above. Here’s an updated version.
It’s 2025 and we’re currently seeing the length of tasks AI can complete double each 4 months [0]. This won’t last forever [1]. But it will last long enough: well into 2026. There are twenty months from now until the end of 2026, so according to this pattern we can expect to see 5 doublings from the current time-horizon of 1.5 hours, which would get us to a time-horizon of 48 hours.
But we should actually expect even faster progress. This for two reasons:
(1) AI researcher productivity will be amplified by increasingly-capable AI [2]
(2) the difficulty of each subsequent doubling is less [3]
This second point is plain to see when we look at extreme cases:
Going from 1 minute to 10 minutes necessitates vast amounts of additional knowledge and skill; from 1 year to 10 years very little of either. The amount of progress required to go from 1.5 to 3 hours is much more than from 24 to 48 hours, so we should expect to see doublings take less than 4 months in 2026, so instead of reaching just 48 hours, we may reach, say, 200 hours.
200 hour time horizons entail agency: error-correction, creative problem solving, incremental improvement, scientific insight, and deeper self-knowledge will all be necessary to carry out these kinds of tasks.
So, by the end of 2026 we will have advanced AGI [4]. Knowledge work in general will be automated as human workers fail to compete on cost, knowledge, reasoning ability, and personability. The only knowledge workers remaining will be at the absolute frontiers of human knowledge. These knowledge workers, such as researchers at frontier AI labs, will have their productivity massively amplified by AI which can do the equivalent of hundreds of hours of skilled human programming, mathematics, etc. work in a fraction of that time.
The economy will not yet have been anywhere near fully-robotised (making enough robots takes time, as does the necessary algorithmic progress), so AI-directed manual labour will be in extremely high demand.
But the writing will be on the wall for all to see: full-automation, including into space industry and hyperhuman science, will be correctly seen as an inevitabilit, and AI company valuations will have increased by totally unprecedented amounts. Leading AI company market capitalisations could realistically measure in the quadrillions, and the S&P-500 in the millions [5].
In 2027 a robotics explosion ensues. Vast amounts of compute come online, space-industry gets started (humanity returns to the Moon). AI surpasses the best human AI researchers, and by the end of the year, AI models trained by superhuman AI come online, decoupled from risible human data corpora, capable of conceiving things humans are simply biologically incapable of understanding. As industry fully robotises, humans obsolesce as workers and spend their time instead in leisure and VR entertainment. Healthcare progresses in leaps and bounds and crime is under control—relatively few people die.
In 2028 mind-upload tech is developed, death is a thing of the past, psychology and science are solved. AI space industry swallows the solar system and speeds rapidly out toward its neighbhors, as ASI initiates its plan to convert the nearby universe into computronium.
Notes:
[0] https://theaidigest.org/time-horizons
[1] https://epoch.ai/gradient-updates/how-far-can-reasoning-models-scale
[2] such as OpenAI’s recently announced Codex
[3] https://www.lesswrong.com/posts/deesrjitvXM4xYGZd/metr-measuring-ai-ability-to-complete-long-tasks?commentId=xQ7cW4WaiArDhchNA
...
[4] Here’s what I mean by “advanced AGI”:
https://web.archive.org/web/20181231195954/https://foresight.org/Conferences/MNT05/Papers/Gubrud/index.php
[5] Associated prediction market:
https://manifold.markets/jim/will-the-sp-500-reach-1000000-by-eo?r=amlt
This sounds highly plausible. There are some other dangers your scenario leaves out, which I tried to explore in If we solve alignment, do we die anyway?