My top interest is AI safety, followed by reinforcement learning. My professional background is in software engineering, computer science, machine learning. I have degrees in electrical engineering, liberal arts, and public policy. I currently live in the Washington, DC metro area; before that, I lived in Berkeley for about five years.
David James
I came here (already knowing the policy) looking for the specific tags to use by searching for angle brackets in the original post. But I could not easily find them there. I found some tags only in comments, but I’m not sure if they are authoritative. My request: please make the details of the exact markup tags stated plainly so it is obvious and easily discoverable. I use the LessWrong Markdown editor, not the WYSIWYG editor. Thanks!
Ah, “view source” got me to the following… Is the following canonical? If so, how do I make it look good in the LessWrong Markdown editor?
<div class="llm-content-block" data-model-name="Claude Opus 4.6"> <div class="llm-content-block-content">` LLM content here </div> </div>
Lobachevsky independently reached the same conclusion around the same time.
A quick and maybe pedantic (sorry!) readability comment: “Lobachevsky” appears here without introduction and without any further mention. My guess: many readers here won’t recognize the name. For me, it felt like “wait, did I miss something?” rather than “ah yes, the parallel discovery.” Suggestion: how about “the Russian mathematician Nikolai Lobachevsky” instead?
Recognizing the importance of choosing and comparing models / concepts might be a prerequisite concept. People learn this in various ways … When it comes to choosing what parameters to include in a model, statisticians compare models in various ways. They care a lot about predictive power for prediction, but also pay attention to multicollinearity for statistical inference. I see connections between a model’s parameters and an argument’s concepts. First, both have costs and benefits. Second, any particular combination has interactive effects that matter. Third, as a matter of epistemic discipline, it is important to recognize the importance of trying and comparing frames of reference: different models for the statistician and different concepts for an argument.
What people concerned about AI-coding …
… said ~2 years ago:
AI coders will find sneaky ways to trick humans
… say today:
Humans won’t even be paying attention
This is obviously and intentionally exaggerated to make a point. Still, I don’t want this take to simply end there. I want to use the snark not as a conversation ender but a starter. So… What are some better ways forward? Generally speaking, I don’t think “better awareness” or “more self-discipline” are winning strategies.
I’m relatively less interested in a competitive framing between OpenAI and Anthropic to see i.e. “who played it better”. First, that framing suggests there was just one game being played. It seems to be necessary to view it as a progression of different games.
To a first approximation, my guess is by the time this popped into the public spotlight, the die was largely cast (so to speak). It was, more or less, a strategy by Hegseth to put Anthropic in an impossible bind.
Second, that kind of framing feels too much like so many news stories I read that try to fasten sports metaphors onto real world events to make juicy narratives. This isn’t a very good “reason” I admit, but it sort of explains why my alarm bells started ringing on that frame.
Personally, I first want to learn about what happened and when. After that, maybe I would try to analyze and learn lessons.
I take your points individually, but I don’t synthesize them in the way I think you might.
To start, the top 0.01% wealthiest people are far from a representative sample from the public. I would expect them to have statistically different personality traits and perceptions even before attaining massive wealth. There is a causal (albeit stochastic) connection between their drives and their outcomes.
Next — even if they were sampled in a representative way — the journey to reaching such a level changes people. Once there*, it affords opportunities of all kinds that are (a) unavailable to the 99.99% and (b) can be hidden or swept under the rug in various ways.
Path dependence matters! Humans are incredibly adaptable for better and worse. From one lens, we can certainly talk about core evolutionary drives, but the way the top 0.01% manifest these drives in their bubble can feel shocking to the rest of us.
* To be clear, I expect most people at that level continue to strive upwards. There is always someone more powerful, at least in some area, to compare oneself against.
Sometimes I revel in the richness of the English language. This at least feels better than wallowing in bewilderment from the cacophony of it.
For example, here are some synonyms for “abtruse” from the Apple dictionary:
obscure, arcane, esoteric, little known, recherché, rarefied, recondite, difficult, hard, puzzling, perplexing, enigmatic, inscrutable, cryptic, Delphic, complex, complicated, involved, over/above one’s head, incomprehensible, unfathomable, impenetrable, mysterious; rare involute, involuted.
As part of an ongoing attempt to communicate my models more openly, here is one very informal “model” about the explosion of language. According to this model, this language proliferation emerges from a combination of these factors:
-
Humans live in different places and have different experiences over time and yet they have much in common so they naturally re-invent words meaning the same thing.
-
Also, even if they already know words that are “good enough”, they get bored and need novelty, so coin words anew.
-
Of course, all the while, many words get blurrier over time, so some people feel the need to create new ones for many reasons: to define identity; to distinguish in- from out-groups; to evade censorship; to prove something (such as knowledge); or simply to seek clarity … for at least for a little while before the landscape shifts again.
-
It is difficult for people to coordinate on a minimum shared vocabulary. Doing so is subject to politics in all its forms. Culture is a form of coordination but seems mostly agglomerative w.r.t. language. Are there forces powerful enough to stop motivated people from adding to a language?
-
Dictionaries are catalogs of usage, after all, not prescriptive, and have no page-length limitations.
-
There is a cost, of course, of having duplicative words, but I suspect this is mostly an economic externality and provides scant deterrence for those who like to make new words.
Every once in a while a John Wilkins, Peter Mark Roget, or Douglas Lenat comes along and strives to systematize knowledge. They probably appreciate the audacity of such efforts, and blaze ahead anyway.
-
Piccione and Rubinstein (2007) have developed a ‘jungle model’. In contrast to standard general equilibrium models, the jungle model endows each agent with power. If someone has greater power, they can simply take stuff from those with less power. The resulting equilibrium has some nice properties, including that it exists, and that it is also Pareto efficient. — AI and the paperclip problem by Joshua Gans, 2018
2026-03-11 update: I added the source of the quotation above. The source PDF is freely available at Equilibrium in the Jungle by Michele Piccione and Ariel Rubinstein.
Here is one easy way to improve everyday Bayesian reasoning: use natural frequencies instead of probabilities. Consider two ways of communicating a situation:
Probability format
1% of women have breast cancer
If a woman has breast cancer, there’s an 70% chance the mammogram is positive
If a woman does not have breast cancer, there’s a 10% chance the mammogram is positive.
Natural frequency format
Out of 1,000 women, 10 have breast cancer
Of those 10 who have cancer, 7 test positive
Of the 990 without cancer, 99 test positive
For each of the two formats above, ask this question to a group of people: “A woman tests positive. What is the probability she has cancer?”. Which do you think gives better results?
References
Gigerenzer, G., & Hoffrage, U. (1995). How to improve Bayesian reasoning without instruction: frequency formats. Psychological review, 102(4), 684.
Hoffrage, U., Lindsey, S., Hertwig, R., & Gigerenzer, G. (2000). Communicating statistical information. Science, 290(5500), 2261-2262.
Hoffrage, U., Gigerenzer, G., Krauss, S., & Martignon, L. (2002). Representation facilitates reasoning: What natural frequencies are and what they are not. Cognition, 84(3), 343-352.
Communication note: writing
EGinstead ofe.g.feels unnecessarily confusing to me.In this context, EG probably should be reserved for Edmund Gettier:
Gettier problems or cases are named in honor of the American philosopher Edmund Gettier, who discovered them in 1963. They function as challenges to the philosophical tradition of defining knowledge of a proposition as justified true belief in that proposition. The problems are actual or possible situations in which someone has a belief that is both true and well supported by evidence, yet which — according to almost all epistemologists — fails to be knowledge. Gettier’s original article had a dramatic impact, as epistemologists began trying to ascertain afresh what knowledge is, with almost all agreeing that Gettier had refuted the traditional definition of knowledge. – https://iep.utm.edu/gettier/
@Hastings … I don’t think I made a comment in this thread—and I don’t see one when I look. I wonder if you are replying to a different one? Link it if you find it?
This diagram from page 4 of “Data Poisoning the Zeitgeist: The AI Consciousness Discourse as a pathway to Legal Catastrophe” conveys the core argument quite well:
I’m about to start reading “Fifty Years of Research on Self-Replication” (1998) by Moshe Sipper. I have a hunch that the history and interconnections therein might be under-appreciated in the field of AI safety. I look forward to diving in.
A quick disclosure of some of my pre-existing biases: I also have a desire to arm myself against the overreaching claims and self-importance of Stephen Wolfram. A friend of mine was “taken in” by Wolfram’s debate with Yudkowsky… and it rather sickened me to see Wolfram exerting persuasive power. At the same time, certain of Wolfram’s rules are indeed interesting, so I want to acknowledge his contributions fairly.
Sorry for the confusion. :P … I do appreciate the feedback. Edited to say: “I’m noticing evidence that many of us may have an inaccurate view of the 1983 Soviet nuclear false alarm. I say this after reading...”
I have also seen conversations get derailed based on such disagreements.
I expect to largely adopt this terminology going forward
May I ask to which audience(s) you think this terminology will be helpful? And what particular phrasing(s) do you plan on trying out?
The quote above from Chalmers is dense and rather esoteric; so I would hesitate to use its particular terminology for most people (the ones likely to get derailed as discussed above). Instead, I would seek out simpler language. As a first draft, perhaps I would say:
Let’s put aside whether LLMs think on the inside. Let’s focus on what we observe—are these observations consistent with the word “thinking”?
Parties aren’t real, the power must be in specific humans or incentive systems.
I would caution against saying “parties aren’t real” for at least two reasons. First, it more-or-less invites definitional wars which are rarely productive. Second, when we think about explanatory and predictive theories, whether something is “real” (however you define it) is often irrelevant. What matters more is is the concept sufficiently clear / standardized / “objective” to measure something and thus serve as some replicable part of a theory.
Humans have long been interested in making sense of power through various theories. One approach is to reduce it to purely individual decisions. Another approach involves attributing power to groups of people or even culture. Models serve many purposes, so I try to ground these kinds of discussions with questions such as:
Are you trying to predict a person’s best next action?
Are you trying to design a political or institutional system such that individual power can manifest in productive ways?
Are you trying to measure the effectiveness of a given politician in a given system, keeping in mind the practical and realistic limits of their agency?
These are very different questions, leading to very different models.
I’m noticing evidence that many of us may have an inaccurate view of the 1983 Soviet nuclear false alarm. I say this after reading “Did Stanislav Petrov save the world in 1983? It’s complicated”. The article is worth reading; it is a clear and detailed ~1100 words. I’ve included some excerpts here:
[...] I must say right away that there is absolutely no reason to doubt Petrov’s account of the events. Also, there is no doubt that Stanislav Petrov did the right thing when he reported up the chain of command that in his assessment the alarm was false. That was a good call in stressful circumstances and Petrov fully deserves the praise for making it.
But did he literally avert a nuclear war and saved the world? In my view, that was not quite what happened. (I would note that as far as I know Petrov himself never claimed that he did.)
To begin with, one assumption that is absolutely critical for the “saved the world” version of the events is that the Soviet Union maintained the so-called launch-on-warning posture. This would mean that it was prepared to launch its missiles as soon as its early-warning system detects a missile attack. This is how US system works and, as it usually happens, most people automatically assume that everybody does the same. Or at least tries to. This assumption is, of course, wrong. The Soviet Union structured its strategic forces to absorb a nuclear attack and focused on assuring retaliation—the posture known as “deep second strike” (“ответный удар”). The idea was that some missiles (and submarines) will survive an attack and will be launched in retaliation once it is over.
[...] the Soviet Union would have waited for actual nuclear detonations on its soil. Nobody would have launched anything based on an alarm generated by the early-warning system, let alone by only one of its segments—the satellites.
[...] It is certain that the alarm would have been recognized as false at some stages. But even if it wasn’t, the most radical thing the General Staff (with the involvement of the political leadership) would do was to issue a preliminary command. No missiles would be launched unless the system detected actual nuclear detonations on the Soviet territory.
Having said that, what Stanislav Petrov did was indeed commendable. The algorithm that generated the alarm got it wrong. The designers of the early-warning satellites took particular pride in the fact that the assessment is done by computers rather than by humans. So, it definitely took courage to make that call to the command center up the chain of command and insist that the alarm is false. We simply don’t know what would have happened if he kept silence or confirmed the alarm as positive. And, importantly, he did not know it either. He just did all he could to prevent the worst from happening.
I haven’t made a thorough assessment myself. For now, I’m adding The Dead Hand to my reading list.
Edited on 2025-12-25 to improve clarity of the first point. Thanks to readers for the feedback.
I am seeking definitions of key foundational concepts in this paper (cognitive pattern, context, influence, selection, motivations) with (a) something as close to formal precision as possible while (b) attempting a minimal word count. This might be asking a lot, but I think it can be done, and I think it is important. I suggest using a very basic foundation: the basic terminology of artificial neural networks (ANNs): neurons, weights, activations, etc. Let the difficulty arise in putting the ideas together, not in confusion about the definitions themselves. If there is ambiguity or variation in how these terms apply, I think it would make sense to lock-in some particulars so the definitions can be tightened up. (Walk before you run.) Even better if these definitions themselves are diagrammed and connected visually (perhaps with something like an ontology diagram).
I’d appreciate any efforts in this direction, thanks! I’ve started a draft myself, but I want to have some properly uninterrupted time to iterate on it before sharing.
Why do I ask this? Personally, I find this article hard to parse due to definitional reasons.
I want to be clear: Lots of terrifyingly smart people made this mistake, including some of the smartest scientists who ever lived. Many of them made this mistake for a decade or more before wising up or giving up.
Imagine this. Imagine a future world where gradient-driven optimization never achieves aligned AI. But there is success of a different kind. At great cost, ASI arrives. Humanity ends. In his few remaining days, a scholar with the pen name of Rete reflects back on the 80s approach (i.e. using deterministic rules and explicit knowledge) with the words: “The technology wasn’t there yet; it didn’t work commercially. But they were onto something—at the very least, their approach was probably compatible with provably safe intelligence. Under other circumstances, perhaps it would have played a more influential role in promoting human thriving.”
Technical note: I recommend open source tools like restic or the Rust version rustic for backups. In my memory, I know Restic is innovative and better than what came before, but I had not memorized why. So I asked Claude for a summary, and it generated this, which jogged and enhanced my memory (which feels like one sweet spot for LLM-assisted thinking, in my opinion):
<div class=”llm-content-block” data-model-name=”Claude Opus 4.6″>
<div class=”llm-content-block-content”>`
Traditional backup tools deduplicated at the file or fixed-block level, so any insertion shifted all subsequent blocks and broke dedup. Restic uses content-defined chunking: a rolling hash (Rabin fingerprint) sets chunk boundaries based on content, not position. Edits only invalidate nearby chunks.
Beyond this, restic eliminates the full/differential/incremental distinction — every snapshot is logically complete, no restore chains. Encryption is mandatory, not optional. Multiple storage backends (S3, SFTP, local) are first-class. Borg had comparable dedup earlier but assumed Python and SSH; restic shipped as a static Go binary targeting cloud storage.
</div>
</div>
Note: I use the Markdown editor. Apparently I didn’t get the LLM markup quite right. I’d appreciate pointers on how to do that. Please share them as comments over on this comment. Once I learn how to do it properly, I will update this comment.