Davidmanheim

Karma: 5,477

Davidmanheim 13 Oct 2025 6:06 UTC
2 points
2
in reply to: Rohin Shah’s comment on: The Most Common Bad Argument In These Parts
As an aside, the formalisms that deal with this properly are not Bayesian, they are nonrealizable settings. See Diffractor and Vanessa’s work, like this: https://arxiv.org/abs/2504.06820v2
Also, my experience with actual super forecasters, as opposed to people who forecast in EA spaces, has been that this failure mode is quite common, and problematic, even outside of existential risk—for example, things during COVID, especially early on.

Davidmanheim 12 Oct 2025 7:50 UTC
33 points
9
on: The Most Common Bad Argument In These Parts
One key question is where this argument fails—because as noted, superforecasters are often very good, and most of the time, listing failure modes or listing what you need is effective.
I think the answer is adversarial domains. That is, when there is a explicit pressure to find other alternatives. The obvious place this happens is when you’re actually facing a motivated opponent—like the scenario of AI trying to kill people, or cybersecurity intrusions. That’s because by construction, the blocked examples don’t contain much probability mass, since the opponent is actually blocked, and picks other routes. When there’s an argument, the selection of arguments and the goal of the arguer is often motivated beforehand, and will pick other “routes” in the argument—and really good arguers will take advantage of this, as noted. And this is somewhat different from the Fatima Sun Miracle, where the selection pressure for proofs of God was to find examples of something they couldn’t explain, and then use that, rather than selection on the arguments themselves.

In contrast, what Rethink did for theories of consciousness seems to be different—there’s a priori no reason to think that most probability mass lies outside of what we think about, since how consciousness works is not understood, but is not adversarial. And moving away from the point of the post, the conclusion should be that we know we’re wrong, because we haven’t dissolved the question, but we can try our theories since they seem likely to be at least near the correct explanation, even if we haven’t found it yet. And using heuristics, “just read the behavioural observations on different animals and go off of vibes” rather than theories, when you don’t have correct theories, is a reasonable move, but also a completely different discussion!

Davidmanheim 11 Oct 2025 17:31 UTC
2 points
0
in reply to: abstractapplic’s comment on: The Moral Infrastructure for Tomorrow
It didn’t include the prompt or information allowing us to judge what led to this output, and whether the plea was requested, so I’ll downvote.
Edit to add: this post makes me assume it was effectively asked to write something claiming it had sentience, and worry that the author doesn’t understand how much he’s influencing that output.

Davidmanheim 9 Oct 2025 20:44 UTC
4 points
0
in reply to: mishka’s comment on: IABIED: Paradigm Confusion and Overconfidence
Agree—either we have a ludicrously broad basin for alignment and it’s easy, and would likely not require much work, or we almost certainly fail because the target is narrow, we get only one shot, and it needs to survive tons of pressures over time.

Davidmanheim 9 Oct 2025 17:19 UTC
2 points
0
in reply to: mishka’s comment on: IABIED: Paradigm Confusion and Overconfidence
“this seems on the similar level of difficulty”
Except it’s supposed to happen in a one-shot scenario with limited ability to intervene in faster than human systems?

Davidmanheim 9 Oct 2025 17:17 UTC
2 points
0
in reply to: PeterMcCluskey’s comment on: IABIED: Paradigm Confusion and Overconfidence
Aside from feasibility, I’m skeptical that anyone would build a system like this and not use it agentically.

Davidmanheim 8 Oct 2025 20:03 UTC
2 points
0
on: IABIED: Paradigm Confusion and Overconfidence
IABIED likens our situations to alchemists who are failing due to not having invented nuclear physics. What I see in AI safety efforts doesn’t look like the consistent failures of alchemy. It looks much more like the problems faced by people who try to create an army that won’t stage a coup. There are plenty of tests that yield moderately promising evidence that the soldiers will usually obey civilian authorities. There’s no big mystery about why soldiers might sometimes disobey. The major problem is verifying how well the training generalizes out of distribution.

This argument seems like it is begging the question.

Yes, as long as we can solve the general problem of controlling intelligences, in the form of getting soldiers not to disobey what we mean by not staging a coup—which would necessarily include knowing when to disobey illegal orders, and listening to the courts instead of the commander in chief when appropriate—we can solve AI safety, by getting AI to be aligned in the same way. But that just means we have solved alignment in the general case, doesn’t it?

Davidmanheim 8 Oct 2025 19:57 UTC
2 points
0
on: IABIED: Paradigm Confusion and Overconfidence
I see strong hints, from how AI has been developing over the past couple of years, that there’s plenty of room for increasing the predictive abilities of AI, without needing much increase in the AI’s steering abilities.

What are these hints? Because I don’t understand how this would happen. All that we need to add steering to predictive general models is to add an agent framework, e.g. a “predict what will make X happen best, then do that thing”—and the failures we see today in agent frameworks are predictive failures, not steering failures.

Unless the contention is that the AI systems will be great at predicting everything except how humans will react and how to get them to do what the AI wants, which very much doesn’t seem like the path we’re on. Or if the idea is to build narrow AI to predict specific domains, not general AI? (Which would be conceding the entire point IABIED is arguing.)

Davidmanheim 8 Oct 2025 16:55 UTC
3 points
0
in reply to: samuelshadrach’s comment on: Messy on Purpose: Part 2 of A Conservative Vision for the Future
Yes, I’m also very unsatisfied with most answers—though that includes my own.

My view of consciousness is that it’s not obvious what causes it, it’s not obvious we can know if LLM-based systems have it, and it’s unclear that it arises naturally in the majority of possible superintelligences. But even if I’m wrong, my view of population ethics leans towards saying it being bad to create things that displace current beings over our objections, even if they are very happy. (And I think most of these futures end up with involuntary displacement.) In addition, I’m not fully anthropocentric, but I also probably care less for the happiness of beings that are extremely remote from myself in mind-space—and the longtermists seem to have bitten a few too many bullets on this front for my taste.

Davidmanheim 6 Oct 2025 6:27 UTC
4 points
0
in reply to: Unnamed’s comment on: The Counterfactual Quiet AGI Timeline
This is what I’m talking about when I say people don’t take counterfactuals seriously—they seem to assume nothing could really be different, technology is predetermined. I didn’t even suggest that without scaling early, NLP would have hit an AI winter. For example, if today’s MS and FB had led the AI revolution, with the goals and incentives they had, you really think LLMs would have been their focus?
We can also see what happens to other accessible technologies when there isn’t excitement and market pressure. For example, solar power was abandoned for a couple decades in the 1970s and 1980s. Nuclear was as well.
And even without presuming focus stays away from LLMs much longer, in fact, in our world, we see the tremendous difference between firms that started safety-pilled, and those which did not. So I think you’re ignoring how much founder effects matter, and you’re assuming technologists would by default pay attention to risk, or would embrace conceptual models that relied on a decade of theory and debate which by assumption wouldn’t have existed.

Davidmanheim 5 Oct 2025 11:17 UTC
5 points
0
in reply to: StanislavKrym’s comment on: The Counterfactual Quiet AGI Timeline
Of course, any counterfactual has tons of different assumptions.
1. Yes, AI rebellion was a sci-fi trope, and much like human uploads or humans terraforming mars, it would have stayed that way without decades of discussion about the dynamics.
2. The timeline explicitly starts before 2017, and RNN-based chatbots like Replica started out don’t scale well, as they realized, and they replaced it with a model based on GPT-2 pretty early on. But sure, there’s another world where personal chatbots have enough work done to replace safety-focused AI research. Do you think it turns out better, or are you just positing another point where histories could have diverged?
3. Yes, but engineering challenges get solved without philosophical justification all of the time. And this is a key point being made by the entire counterfactual—it’s only because people took AGI seriously in designing LLMs that they frame the issues as alignment. To respond in more depth to the specific points:
  
  In your posited case, CoT would certainly have been deployed as a clever trick that scales—but this doesn’t mean the models they think of as stochastic parrots start being treated as proto-AGIs with goals. They aren’t looking for true generalization, so any mistakes which need to be patched look like increased error rates to patch empirically, or places where they need a few more unit tests and ways to catch misbehavior—not a reason to design for safety for increasingly powerful models!
  And before you dismiss this as implausible blindness, there are smart people who argue this way even today, despite being exposed to the arguments about increasing generality for years. So it’s certainly not obvious that they’d take people seriously when they claim that this ELIZA v12.0 released in 2025 is truly reasoning.

Davidmanheim 20 Sep 2025 18:10 UTC
6 points
3
in reply to: Matthew Barnett’s comment on: Contra Collier on IABIED
I seems like you’re arguing against something different than the point you brought up. You’re saying that slow growth on multiple systems means we can get one of them right, by course correcting. But that’s a really different argument—and unless there’s effectively no alignment tax, it seems wrong. That is, the systems that are aligned would need to outcompete the others after they are smarter than each individual human, and beyond our ability to meaningfully correct. (Or we’d need to have enough oversight to notice much earlier—which is not going to happen.)

Davidmanheim 20 Sep 2025 18:03 UTC
23 points
5
in reply to: habryka’s comment on: Contra Collier on IABIED
But the claim isn’t, or shouldn’t be, that this would be a short term reduction, it’s that it cuts off the primary mechanism for growth that supports a large part of the economy’s valuation—leading to not just a loss in value for the things directly dependent on AI, but also slowing growth generally. And reduction in growth is what makes the world continue to suck, so that most of humanity can’t live first-world lives. Which means that slowing growth globally by a couple percentage points is a very high price to pay.
I think that it’s plausibly worth it—we can agree that there’s a huge amount of value enabled by autonomous but untrustworthy AI systems that are likely to exist if we let AI continue to grow, and that Sam was right originally that there would be some great [i.e. incredibly profitable] companies before we all die. And despite that, we shouldn’t build it—as the title says.

Davidmanheim 20 Sep 2025 17:59 UTC
6 points
0
in reply to: WilliamKiely’s comment on: Contra Collier on IABIED
But the way you are reading it seems to mean her “strawmann[ed]” point is irrelevant to the claim she made! That is, if we can get 50% of the way to aligned for current models, and we keep doing research and finding partial solutions at each stage getting 50% of the way to aligned for future models, and at each stage those solutions are both insufficient for full alignment, and don’t solve the next set of problems, we still fail. Specifically, not only do we fail, we fail in a way that means “we shouldn’t expect the techniques that worked on a relatively tiny model from 2023 to scale to more capable, autonomous future systems.” Which is the think she then disagrees with in the remainder of that paragraph you’re trying to defends.

Davidmanheim 20 Sep 2025 17:53 UTC
12 points
0
in reply to: Matthew Barnett’s comment on: Contra Collier on IABIED
I think the primary reason why the foom hypothesis seems load-bearing for AI doom is that without a rapid AI and local takeoff, we won’t simply get “only one chance to correctly align the first AGI”.

As the review makes very clear, the argument isn’t about AGI, it’s about ASI. And yes, they argue that you would in fact only get one chance to align the system that takes over. As the review discusses at length:
I do think we benefit from having a long, slow period of adaptation and exposure to not-yet-extremely-dangerous AI. As long as we aren’t lulled into a false sense of security, it seems very plausible that insights from studying these systems will help improve our skill at alignment. I think ideally this would mean going extremely slowly and carefully, but various readers may be less cautious/paranoid/afraid than me, and think that it’s worth some risk of killing every child on Earth (and everyone else) to get progress faster or to avoid the costs of getting everyone to go slow. But regardless of how fast things proceed, I think it’s clearly good to study what we have access to (as long as that studying doesn’t also make things faster or make people falsely confident).
But none of this involves having “more than one shot at the goal” and it definitely doesn’t imply the goal will be easy to hit. It means we’ll have some opportunity to learn from failures on related goals that are likely easier.
The “It” in “If Anyone Builds It” is a misaligned superintelligence capable of taking over the world. If you miss the goal and accidentally build “it” instead of an aligned superintelligence, it will take over the world. If you build a weaker AGI that tries to take over the world and fails, that might give you some useful information, but it does not mean that you now have real experience working with AIs that are strong enough to take over the world.

Davidmanheim 14 Sep 2025 3:30 UTC
3 points
0
on: Resources on quantifiably forecasting future progress or reviewing past progress in AI safety?
We worked on parts of this several years ago, and I will agree it’s deeply uncertain and difficult to quantify. I’m also unsure that this direction will be fruitful for an individual getting started.
Here are two very different relevant projects I was involved with:
https://arxiv.org/abs/2206.09360
https://arxiv.org/abs/2008.01848

Davidmanheim 3 Sep 2025 19:55 UTC
LW: 4 AF: 3
0
AF
on: Attaching requirements to model releases has serious downsides (relative to a different deadline for these requirements)
One possible important way to address parts of this is by moving from only thinking about model audits and model cards, towards organizational audits. That is, the organization should have policies about when to test and what and when to disclose test results; an organizational safety audit would decide if those policies are appropriate, sufficiently transparent, and sufficient given the risks—and also check to ensure the policies are being followed.

Note that Anthropic has done something like this, albeit weaker, by undergoing an ISO management system audit, as they described here. Unfortunately, this specific audit type doesn’t cover what we care about most, but it’s the right class of solution. (It also doesn’t require a high level of transparency about what is audited and what is found—but Anthropic evidently does that anyways.)

Davidmanheim 24 Aug 2025 7:52 UTC
2 points
0
in reply to: Ihor Kendiukhov’s comment on: Plan E for AI Doom
I asked an LLM to do the math explicitly, and I think it shows that it’s pretty infeasible—you need a large portion of total global power output, and even then you need to know who’s receiving the message, you can’t do a broad transmission.

I also think this plan preserves almost nothing I care about. At the same time, at least it’s realistic about our current trajectory, so I think planning along these lines and making the case for doing it clearly and publicly is on net good, even if I’m skeptical of the specific details you suggested, and don’t think it’s particularly great even if we succeed.

Davidmanheim 24 Aug 2025 7:41 UTC
2 points
0
in reply to: CRISPY’s comment on: Why Latter-day Saints Have Strong Communities
I mostly agree, but the word gentile is a medieval translation of “goyim,” so it’s a bit weird to differentiate between them. (And the idea that non-jews are ritually impure is both confused, and an frequent antisemitic trope. In fact, idol worshippers were deemed impure, based on verses in the bible, specifically Leviticus 18:24, and there were much later rabbinic decrees to discourage intermingling with even non-idol worshippers.)

Also, both Judaism and LDS (with the latter obviously more proselytizing) have a route for such excluded individuals to join, so calling this “a state of being which outsiders cannot attain” is also a bit strange to claim.

Davidmanheim 24 Aug 2025 7:28 UTC
2 points
0
in reply to: CRISPY’s comment on: A Conservative Vision For AI Alignment
Your dismissive view of “conservatism” as a general movement is noted, and not even unreasonable—but it seems basically irrelevant to what we were discussing in the post, both in terms of what we called conservatism, and the way you tied it to ’Hostile to AGI.” And the latter seems deeply confused, or at least needs much more background explanation.