My top interest is AI safety, followed by reinforcement learning. My professional background is in software engineering, computer science, machine learning. I have degrees in electrical engineering, liberal arts, and public policy. I currently live in the Washington, DC metro area; before that, I lived in Berkeley for about five years.
David James
What people concerned about AI-coding …
… said ~2 years ago:
AI coders will find sneaky ways to trick humans
… say today:
Humans won’t even be paying attention
This is obviously and intentionally exaggerated to make a point. Still, I don’t want this take to simply end there. I want to use the snark not as a conversation ender but a starter. So… What are some better ways forward? Generally speaking, I don’t think “better awareness” or “more self-discipline” are winning strategies.
I’m relatively less interested in a competitive framing between OpenAI and Anthropic to see i.e. “who played it better”. First, that framing suggests there was just one game being played. It seems to be necessary to view it as a progression of different games.
To a first approximation, my guess is by the time this popped into the public spotlight, the die was largely cast (so to speak). It was, more or less, a strategy by Hegseth to put Anthropic in an impossible bind.
Second, that kind of framing feels too much like so many news stories I read that try to fasten sports metaphors onto real world events to make juicy narratives. This isn’t a very good “reason” I admit, but it sort of explains why my alarm bells started ringing on that frame.
Personally, I first want to learn about what happened and when. After that, maybe I would try to analyze and learn lessons.
I take your points individually, but I don’t synthesize them in the way I think you might.
To start, the top 0.01% wealthiest people are far from a representative sample from the public. I would expect them to have statistically different personality traits and perceptions even before attaining massive wealth. There is a causal (albeit stochastic) connection between their drives and their outcomes.
Next — even if they were sampled in a representative way — the journey to reaching such a level changes people. Once there*, it affords opportunities of all kinds that are (a) unavailable to the 99.99% and (b) can be hidden or swept under the rug in various ways.
Path dependence matters! Humans are incredibly adaptable for better and worse. From one lens, we can certainly talk about core evolutionary drives, but the way the top 0.01% manifest these drives in their bubble can feel shocking to the rest of us.
* To be clear, I expect most people at that level continue to strive upwards. There is always someone more powerful, at least in some area, to compare oneself against.
Sometimes I revel in the richness of the English language. This at least feels better than wallowing in bewilderment from the cacophony of it.
For example, here are some synonyms for “abtruse” from the Apple dictionary:
obscure, arcane, esoteric, little known, recherché, rarefied, recondite, difficult, hard, puzzling, perplexing, enigmatic, inscrutable, cryptic, Delphic, complex, complicated, involved, over/above one’s head, incomprehensible, unfathomable, impenetrable, mysterious; rare involute, involuted.
As part of an ongoing attempt to communicate my models more openly, here is one very informal “model” about the explosion of language. According to this model, this language proliferation emerges from a combination of these factors:
-
Humans live in different places and have different experiences over time and yet they have much in common so they naturally re-invent words meaning the same thing.
-
Also, even if they already know words that are “good enough”, they get bored and need novelty, so coin words anew.
-
Of course, all the while, many words get blurrier over time, so some people feel the need to create new ones for many reasons: to define identity; to distinguish in- from out-groups; to evade censorship; to prove something (such as knowledge); or simply to seek clarity … for at least for a little while before the landscape shifts again.
-
It is difficult for people to coordinate on a minimum shared vocabulary. Doing so is subject to politics in all its forms. Culture is a form of coordination but seems mostly agglomerative w.r.t. language. Are there forces powerful enough to stop motivated people from adding to a language?
-
Dictionaries are catalogs of usage, after all, not prescriptive, and have no page-length limitations.
-
There is a cost, of course, of having duplicative words, but I suspect this is mostly an economic externality and provides scant deterrence for those who like to make new words.
Every once in a while a John Wilkins, Peter Mark Roget, or Douglas Lenat comes along and strives to systematize knowledge. They probably appreciate the audacity of such efforts, and blaze ahead anyway.
-
Piccione and Rubinstein (2007) have developed a ‘jungle model’. In contrast to standard general equilibrium models, the jungle model endows each agent with power. If someone has greater power, they can simply take stuff from those with less power. The resulting equilibrium has some nice properties, including that it exists, and that it is also Pareto efficient. — AI and the paperclip problem by Joshua Gans, 2018
2026-03-11 update: I added the source of the quotation above. The source PDF is freely available at Equilibrium in the Jungle by Michele Piccione and Ariel Rubinstein.
Here is one easy way to improve everyday Bayesian reasoning: use natural frequencies instead of probabilities. Consider two ways of communicating a situation:
Probability format
1% of women have breast cancer
If a woman has breast cancer, there’s an 70% chance the mammogram is positive
If a woman does not have breast cancer, there’s a 10% chance the mammogram is positive.
Natural frequency format
Out of 1,000 women, 10 have breast cancer
Of those 10 who have cancer, 7 test positive
Of the 990 without cancer, 99 test positive
For each of the two formats above, ask this question to a group of people: “A woman tests positive. What is the probability she has cancer?”. Which do you think gives better results?
References
Gigerenzer, G., & Hoffrage, U. (1995). How to improve Bayesian reasoning without instruction: frequency formats. Psychological review, 102(4), 684.
Hoffrage, U., Lindsey, S., Hertwig, R., & Gigerenzer, G. (2000). Communicating statistical information. Science, 290(5500), 2261-2262.
Hoffrage, U., Gigerenzer, G., Krauss, S., & Martignon, L. (2002). Representation facilitates reasoning: What natural frequencies are and what they are not. Cognition, 84(3), 343-352.
Communication note: writing
EGinstead ofe.g.feels unnecessarily confusing to me.In this context, EG probably should be reserved for Edmund Gettier:
Gettier problems or cases are named in honor of the American philosopher Edmund Gettier, who discovered them in 1963. They function as challenges to the philosophical tradition of defining knowledge of a proposition as justified true belief in that proposition. The problems are actual or possible situations in which someone has a belief that is both true and well supported by evidence, yet which — according to almost all epistemologists — fails to be knowledge. Gettier’s original article had a dramatic impact, as epistemologists began trying to ascertain afresh what knowledge is, with almost all agreeing that Gettier had refuted the traditional definition of knowledge. – https://iep.utm.edu/gettier/
@Hastings … I don’t think I made a comment in this thread—and I don’t see one when I look. I wonder if you are replying to a different one? Link it if you find it?
This diagram from page 4 of “Data Poisoning the Zeitgeist: The AI Consciousness Discourse as a pathway to Legal Catastrophe” conveys the core argument quite well:
I’m about to start reading “Fifty Years of Research on Self-Replication” (1998) by Moshe Sipper. I have a hunch that the history and interconnections therein might be under-appreciated in the field of AI safety. I look forward to diving in.
A quick disclosure of some of my pre-existing biases: I also have a desire to arm myself against the overreaching claims and self-importance of Stephen Wolfram. A friend of mine was “taken in” by Wolfram’s debate with Yudkowsky… and it rather sickened me to see Wolfram exerting persuasive power. At the same time, certain of Wolfram’s rules are indeed interesting, so I want to acknowledge his contributions fairly.
Sorry for the confusion. :P … I do appreciate the feedback. Edited to say: “I’m noticing evidence that many of us may have an inaccurate view of the 1983 Soviet nuclear false alarm. I say this after reading...”
I have also seen conversations get derailed based on such disagreements.
I expect to largely adopt this terminology going forward
May I ask to which audience(s) you think this terminology will be helpful? And what particular phrasing(s) do you plan on trying out?
The quote above from Chalmers is dense and rather esoteric; so I would hesitate to use its particular terminology for most people (the ones likely to get derailed as discussed above). Instead, I would seek out simpler language. As a first draft, perhaps I would say:
Let’s put aside whether LLMs think on the inside. Let’s focus on what we observe—are these observations consistent with the word “thinking”?
Parties aren’t real, the power must be in specific humans or incentive systems.
I would caution against saying “parties aren’t real” for at least two reasons. First, it more-or-less invites definitional wars which are rarely productive. Second, when we think about explanatory and predictive theories, whether something is “real” (however you define it) is often irrelevant. What matters more is is the concept sufficiently clear / standardized / “objective” to measure something and thus serve as some replicable part of a theory.
Humans have long been interested in making sense of power through various theories. One approach is to reduce it to purely individual decisions. Another approach involves attributing power to groups of people or even culture. Models serve many purposes, so I try to ground these kinds of discussions with questions such as:
Are you trying to predict a person’s best next action?
Are you trying to design a political or institutional system such that individual power can manifest in productive ways?
Are you trying to measure the effectiveness of a given politician in a given system, keeping in mind the practical and realistic limits of their agency?
These are very different questions, leading to very different models.
I’m noticing evidence that many of us may have an inaccurate view of the 1983 Soviet nuclear false alarm. I say this after reading “Did Stanislav Petrov save the world in 1983? It’s complicated”. The article is worth reading; it is a clear and detailed ~1100 words. I’ve included some excerpts here:
[...] I must say right away that there is absolutely no reason to doubt Petrov’s account of the events. Also, there is no doubt that Stanislav Petrov did the right thing when he reported up the chain of command that in his assessment the alarm was false. That was a good call in stressful circumstances and Petrov fully deserves the praise for making it.
But did he literally avert a nuclear war and saved the world? In my view, that was not quite what happened. (I would note that as far as I know Petrov himself never claimed that he did.)
To begin with, one assumption that is absolutely critical for the “saved the world” version of the events is that the Soviet Union maintained the so-called launch-on-warning posture. This would mean that it was prepared to launch its missiles as soon as its early-warning system detects a missile attack. This is how US system works and, as it usually happens, most people automatically assume that everybody does the same. Or at least tries to. This assumption is, of course, wrong. The Soviet Union structured its strategic forces to absorb a nuclear attack and focused on assuring retaliation—the posture known as “deep second strike” (“ответный удар”). The idea was that some missiles (and submarines) will survive an attack and will be launched in retaliation once it is over.
[...] the Soviet Union would have waited for actual nuclear detonations on its soil. Nobody would have launched anything based on an alarm generated by the early-warning system, let alone by only one of its segments—the satellites.
[...] It is certain that the alarm would have been recognized as false at some stages. But even if it wasn’t, the most radical thing the General Staff (with the involvement of the political leadership) would do was to issue a preliminary command. No missiles would be launched unless the system detected actual nuclear detonations on the Soviet territory.
Having said that, what Stanislav Petrov did was indeed commendable. The algorithm that generated the alarm got it wrong. The designers of the early-warning satellites took particular pride in the fact that the assessment is done by computers rather than by humans. So, it definitely took courage to make that call to the command center up the chain of command and insist that the alarm is false. We simply don’t know what would have happened if he kept silence or confirmed the alarm as positive. And, importantly, he did not know it either. He just did all he could to prevent the worst from happening.
I haven’t made a thorough assessment myself. For now, I’m adding The Dead Hand to my reading list.
Edited on 2025-12-25 to improve clarity of the first point. Thanks to readers for the feedback.
I am seeking definitions of key foundational concepts in this paper (cognitive pattern, context, influence, selection, motivations) with (a) something as close to formal precision as possible while (b) attempting a minimal word count. This might be asking a lot, but I think it can be done, and I think it is important. I suggest using a very basic foundation: the basic terminology of artificial neural networks (ANNs): neurons, weights, activations, etc. Let the difficulty arise in putting the ideas together, not in confusion about the definitions themselves. If there is ambiguity or variation in how these terms apply, I think it would make sense to lock-in some particulars so the definitions can be tightened up. (Walk before you run.) Even better if these definitions themselves are diagrammed and connected visually (perhaps with something like an ontology diagram).
I’d appreciate any efforts in this direction, thanks! I’ve started a draft myself, but I want to have some properly uninterrupted time to iterate on it before sharing.
Why do I ask this? Personally, I find this article hard to parse due to definitional reasons.
I want to be clear: Lots of terrifyingly smart people made this mistake, including some of the smartest scientists who ever lived. Many of them made this mistake for a decade or more before wising up or giving up.
Imagine this. Imagine a future world where gradient-driven optimization never achieves aligned AI. But there is success of a different kind. At great cost, ASI arrives. Humanity ends. In his few remaining days, a scholar with the pen name of Rete reflects back on the 80s approach (i.e. using deterministic rules and explicit knowledge) with the words: “The technology wasn’t there yet; it didn’t work commercially. But they were onto something—at the very least, their approach was probably compatible with provably safe intelligence. Under other circumstances, perhaps it would have played a more influential role in promoting human thriving.”
We might soon build something a lot smarter than us.
Logically speaking, “soon” is not required for your argument. With regards to persuasiveness, saying “soon” is double-edged: those who think soon applies will feel more urgency, but the rest are given an excuse to tune out. A thought experiment by Stuart Russell goes like this: if we know with high confidence a meteor will smack Earth in 50 years, what should we do? This is an easy call. Prepare, starting now.
The right time to worry about a potentially serious problem for humanity depends not just on when the problem will occur but also on how long it will take to prepare and implement a solution. For example, if we were to detect a large asteroid on course to collide with Earth in 2069, would we wait until 2068 to start working on a solution? Far from it! There would be a worldwide emergency project to develop the means to counter the threat, because we can’t say in advance how much time is needed. - Book Review: Human Compatible at Slate Star Codex
In Chapter 6 of his 2019 book, Human Compatible: Artificial Intelligence and the Problem of Control, Russell lists various objections to what one would hope would be well underway as of 2019:
“The implications of introducing a second intelligent species onto Earth are far-reaching enough to deserve hard thinking.” So ended The Economist magazine’s review of Nick Bostrom’s Superintelligence. Most would interpret this as a classic example of British understatement. Surely, you might think, the great minds of today already doing this hard thinking—engaging in serious debate, weighing up the risks and benefits, seeking solutions, ferreting out loopholes in solutions, and so on. Not yet, as far as I am aware.
The 50-year-away meteor is discussed by Human Compatible on page 151 as well as in
Reason 2 of 10 Reasons to Ignore AI Safety by Robert Miles (adapted from H.C. Chapter 6)
Provably Beneficial Artificial Intelligence by Russell (PDF, free)
I’ll share a stylized story inspired from my racing days of many years ago. Imagine you are a competitive amateur racing road cyclist. After years of consistent training racing in the rain, wind, heat, and snow, you are ready for your biggest race of the year, a 60-something mile hilly road race. Having completed your warm-up, caffeination, and urination rituals, you line up with your team and look around. Having agreed you are the leader for the day (you are best suited to win given the course and relative fitness), you chat about the course, wind, rivals, feed zones, and so on.
Seventy or so brightly-colored participants surround you, all optimized to convert calories into rotational energy. You feel the camaraderie among this traveling band of weekend warriors. Lots of determination and shaved legs. You notice but don’t worry about the calves carved out of wood since they are relatively weak predictors of victory. Speaking of...
You hazard one of twenty some-odd contenders is likely to win. It will involve lots of fitness and some blend of grit, awareness, timing, team support, adaptability, and luck. You estimate you are in the top twenty based on previous events. So from the outside viewpoint, you estimate your chances of winning are low, roughly 5%.
What’s that? … You hear three out-of-state semi-pro mountain bikers are in the house. Teammates too. They have decided to “use this race for training”. Lovely. You update your P(win) to ~3%. Does this 2% drop bother you? It is actually a 40% decrease. For a moment maybe but not for long. What about the absolute probability? Does a 3% chance demotivate you? Hell no. A low chance of winning will not lower your level of effort.
You remember that time trial from the year before where your heart rate was pegged at your threshold for something like forty minutes. The heart-rate monitor reading was higher than you expected, but your body indicated it was doable. At the same time, your exertion level was right on the edge of unsustainable. Saying “it’s all mental” is cliché, but in that case, it was close enough to the truth. So you engaged in some helpful self-talk (a.k.a. repeating of the mantra “I’m not going to crack”) for the last twenty minutes. There was no voodoo nor divine intervention; it was just one way to focus the mind to steer the body in a narrow performance band.
You can do that again, you think. You assume a mental state of “I’m going to win this” as a conviction, a way of enhancing your performance without changing your epistemic understanding.
How are you doing to win? You don’t know exactly. You can say this: you will harness your energy and abilities. You review your plan. Remain calm, pay attention, conserve energy until the key moments, trust your team to help, play to your strengths, and when the time is right, take a calculated risk. You have some possible scenarios in mind; get in a small breakaway, cooperate, get ready for cat-and-mouse at the end, maybe open your sprint from 800 meters or farther. (You know from past experiences your chances of winning a pack sprint are very low.)
Are we starting soon? Some people are twitchy. Lots of cycling computer beeping and heart-rate monitor fiddling. Ready to burst some capillaries? Ready to drop the hammer? Turn the pedals in anger? Yes! … and no. You wait some more. This is taking a while. (Is that person really peeing down their leg? Caffeine intake is not an exact science apparently. Better now than when the pistons are motion.)
After a seemingly endless sequence of moments, a whistle blows. The start! The clicking of shoes engaging with pedals. Leading to … not much … the race starts with a neutral roll out. Slower than anyone wants. But the suspense builds … until maybe a few minutes later … a cacophony of shifting of gears … and the first surge begins. This hurts. Getting dropped at the beginning is game-over. Even trading position to save energy is unwise right now—you have to be able to see the front of the pack until things calm down a bit. You give whatever level of energy is needed, right now.
One gets the feeling that the authors are just raising their hands and saying something like: “Look, we are doomed, and there’s no realistic way we’re getting out of this short of doing stuff we are not going to do. These proposals are the necessary consequence of accepting what is stated in the preceding chapters, so that’s that”.
First, personally, I can understand how some people could have this response, and I empathize—this topic is heavy. However, with some time and emotional distance, one can see the above interpretation doesn’t correspond with key passages from the book, such as this one from Chapter 13:
Anytime someone tells you that the Earth could not possibly manage to do anything as difficult as restricting AI research, they are really claiming to know that countries will never care. They are asserting that countries and their leaders could not possibly come to care even 1 percent as much as they cared to fight World War II.
Also, the last chapter (14, “Where There’s Life, There’s Hope”) centers around people facing the truth of their circumstances and rising to the occasion. Here is a key quote:
Humanity averted nuclear war because people who understood that the world was on track for destruction worked hard to change tracks.
Chapter 14 concludes by offering tailored suggestions to civil servants, elected officials, political leaders, journalists, or “the rest of us”. The authors are decidedly not sending up the white flag; instead, they are stating what needs to be done and advocating for it.
Recognizing the importance of choosing and comparing models / concepts might be a prerequisite concept. People learn this in various ways … When it comes to choosing what parameters to include in a model, statisticians compare models in various ways. They care a lot about predictive power for prediction, but also pay attention to multicollinearity for statistical inference. I see connections between a model’s parameters and an argument’s concepts. First, both have costs and benefits. Second, any particular combination has interactive effects that matter. Third, as a matter of epistemic discipline, it is important to recognize the importance of trying and comparing frames of reference: different models for the statistician and different concepts for an argument.