Ben Goldhaber

Karma: 1,415

Ben Goldhaber 19 Jan 2026 19:45 UTC
4 points
0
on: Lifelink™: Freedom for your Child
$50,000 FRC legal defense insurance, access to our specialized attorneys, and a cryptographically signed Safety Log that has precedent in state courts as admissible evidence of your child’s continuous supervision via Lifelink
I like this idea of insurance here—it would help with some of my fears—and you could imagine the company building this giving higher defense budgets at the start to help fight court cases that would establish precedent that Lifelink is suitable monitoring.

Ben Goldhaber 16 Jan 2026 17:34 UTC
13 points
0
on: What’s going on at CFAR? (Updates and Fundraiser)
Congratulations to Anna and the team for cohering around a vision and set of experiments. I donated to the new CFAR; I hope you continue posting about what you learn through the upcoming workshops.
One {note? suggestion? “real spirit” discussion point?} - I feel like the framing of aCFAR was missing something important about the state of rationality today. Namely from this, the year 2026 onward, being more rational is unlikely to be a “human technique” only affair. It will look more like cyborgs and centaurs—humans using AI tools and agents in different configurations and ways to make better decisions.
I won’t belabor how good the AIs have gotten, and instead will just note that they are effective aids for rationalist techniques:
- I wrote a post about backchaining where I had Claude create malleable, customizable timelines. I found this to be a really effective way to “feel” at the S1 level the constraints and targets.
- They’re very good at making Fermi estimates.
- There’s ongoing research and experiments into using them as mediators and for fostering cooperation, à la double crux.
- They’re probably useful for Focusing and internal work too (I know the Jhourney team has been running experiments here, though I haven’t found it that effective personally).
I appreciate that it’s a Center for Applied Rationality, and maybe this particular center doesn’t need to think about the cyborg angle and can just focus on developing better models of “who-ness”. Maybe a different center should!
But it seems valuable to consider, to the extent you want to push forward the frontier of rationality. I suspect there’s some connection between the moments when AI meaningfully aids my real thinking, the moments when I’m doing slop-ful fake thinking where the AI is aiding my delusions, and the concept you’re defining as “who-ness.” Who-ness seems adjacent to taste, which might matter a lot for steering AI fleets towards goodness and meaningful concepts. And it’s probably the case that the general rationality techniques you’re imparting and working on with attendees could be more effective with AI assistance.

Ben Goldhaber 5 Jan 2026 15:25 UTC
2 points
0
on: bgold’s Shortform
In Jack Clark’s Import AI 439, he references a new paper Universally Converging Representations of Matter Across Scientific Foundation Models

> Do AI systems end up finding similar ways to represent the world to themselves? Yes, as they get smarter and more capable, they arrive at a common set of ways of representing the world.
> The latest evidence for this is research from MIT which shows that this is true for scientific models and the modalities they’re trained on: “representations learned by nearly sixty scientific models, spanning string-, graph-, 3D atomistic, and protein-based modalities, are highly aligned across a wide range of chemical systems,” they write. “Models trained on different datasets have highly similar representations of small molecules, and machine learning interatomic potentials converge in representation space as they improve in performance, suggesting that foundation models learn a common underlying representation of physical reality.”
> As with other studies of representation, they found that as you scale the data and compute models are trained on, “their representations converge further”.

This seems like useful empirical evidence for the Natural Abstraction Hypothesis (I haven’t been following progress on that research agenda, so I don’t know how significant of an update this is)

Ben Goldhaber 2 Dec 2025 14:44 UTC
3 points
0
in reply to: Saul Munn’s comment on: Explosive Skill Acquisition
Thank you for the comment Saul—I agree with a lot of your points, in particular that “explosive” periods are costly and inefficient (relative perhaps to some ideal), and that they are not in and of themselves a solution for long-term retention.
I expect if we have a crux it’s whether someone who intends to follow an incremental path vs someone who does an intense acquisition period is more likely to, ~ a year later, actually have the skill. And my guess is, for a number of reasons, it’s the later; I’d expect a lot of incrementalists to ‘just not actually do the thing’.
* My ideal strategy would be “explore lightly a number of things, to determine what you want → explode towards that for an intense period of time → establish incremental practices to maintain and improve”
* Your comment also highlighted for me, something that I had cut from the initial draft, my belief that explosive periods help overcome emotional blockers, which I think might be a big part of why people shy away from skills they say they want.

Ben Goldhaber 30 Nov 2025 19:25 UTC
2 points
0
in reply to: cousin_it’s comment on: Explosive Skill Acquisition
No worries, I appreciate the perspective. I agree that for many skills there is a consolidation and rest period that is needed. An obvious example is that you can’t cram all of the effort needed to build muscle into one week and expect the same kinds of returns that you would get over many months. Though, I do expect you could master the biomechanical skills of weightlifting much faster with that attitude!
If you have examples of the multidimensional learning schedule, I’d love to hear them. I’m imagining something like {30 minutes of spanish language shows}?

Ben Goldhaber 30 Nov 2025 18:56 UTC
2 points
0
in reply to: philip_b’s comment on: Explosive Skill Acquisition
thanks for sharing a data point on that claim

Ben Goldhaber 26 Nov 2025 4:06 UTC
LW: 6 AF: 2
4
AF
on: Alignment will happen by default. What’s next?
Thank you for the post! I have also been (very pleasantly!) surprised by how aligned current models seem to be.
I’m curious if you expect ‘alignment by default’ to continue to hold in a regime where continuous learning is solved and models are constantly updating themselves/being updated by what they encounter in the world?

Chain of Thought not producing evidence of scheming or instrumental convergence does seem like evidence against, but it seems quite weak to me as a proxy for what to expect from ‘true agents’. CoT doesn’t run long enough or have the type of flexibility I’d expect to see in an agent that was actually learning over long time horizons, which would give it the affordance to contemplate the many ways it could accomplish its goals.
And, while just speculation, I imagine that the kind of training procedures we’re doing now to instill alignment will not be possible with Continuous Learning, or we’ll have to pay a heavy alignment tax to do that for these agents. Note: Jan’s recent tweet on his impression that it is quite hard to align large models and it doesn’t fall out for free from size.

Ben Goldhaber 10 Nov 2025 4:48 UTC
5 points
0
in reply to: Mo Putera’s comment on: Unexpected Things that are People
Thank you! I had not read this Kevin Simler essay but I quite like it, and it does match my perspective.

Ben Goldhaber 28 Apr 2025 18:17 UTC
3 points
0
in reply to: Garrett Baker’s comment on: bgold’s Shortform
I did! and I in fact have read—well some of :) - the whitepaper. But it still seems weird that it’s not possible to Increase the Trust in the third party through financial means, dramatic PR stunts (auditor promises to commit sepuku if they are found to have lied)

Ben Goldhaber 28 Apr 2025 18:15 UTC
1 point
0
in reply to: ryan_greenblatt’s comment on: bgold’s Shortform
source needed, but I recall someone on the community notes team saying it was very similar but there are some small differences between prod and the open source version (it’s difficult to maintain exact compatibility). For the point of the comment and context I agree open source does a good job of this, though given the number of people on twitter who still allege its being manipulated, I think you need some additional juice (a whistleblower prize?)

Ben Goldhaber 26 Apr 2025 16:55 UTC
16 points
0
on: bgold’s Shortform
Why so few third party auditors of algorithms? for instance, you could have an auditing agency make specific assertions about what the twitter algorithm is doing, whether the community notes is ‘rigged’
- It could be that this is too large of a codebase, too many people can make changes, it’s too hard to verify the algorithm in production is stable. This seems unlikely to me with most modern devops stacks
- It could be that no one will trust the third party agency. I guess this seems most likely… but really, have we even tried? Could we not have some group of monk like Auditors who would rather die than lie (my impression is some cyber professionals have this ethos already)
If Elon wanted to spend a couple hundred thousand on insanely commited high integrity auditors, it’d be a great experiment

Ben Goldhaber 24 Apr 2025 14:26 UTC
4 points
0
on: bgold’s Shortform
epistemic status: thought about this for like 15 minutes + two deep research reports

a contrarian pick for underrated technology area is lie detection through brain imaging. It seems like it will become much more robust and ecologically valid through compute scaled AI techniques, and it’s likely to be much better at lie detection than humans because we didn’t have access to images of the internals of other peoples brains in the ancestral environment.

On the surface this seems like it would be transformative—brain scan key employees to make sure they’re not leaking information! test our leaders for dark triad traits (ok that’s a bit different than specific lies but still) - however there’s a cynical part of me that sounds like some combo of @ozziegooenand Robin Hanson which notes we have methods now (like significantly increased surveillance and auditing) which we could use for greater trust and which we don’t employ.

So perhaps this won’t be used except for the most extreme natsec cases, where there are already norms of investigations and reduced privacy.

Related quicktake: https://www.lesswrong.com/posts/hhbibJGt2aQqKJLb7/shortform-1#25tKsX59yBvNH7yjD

Ben Goldhaber 24 Apr 2025 14:14 UTC
3 points
0
in reply to: ozziegooen’s comment on: bgold’s Shortform
Good points! I agree that actual prototyping is necessary to see if an idea works, and as a demo it can be far more convincing. Especially w/ the decreased cost of building web apps, leveraging them for fast demos of techniques seems valuable.

Ben Goldhaber 22 Apr 2025 15:35 UTC
14 points
9
on: bgold’s Shortform
AI for improving human reasoning seems promising; I’m uncertain whether it makes sense to invest in new custom applications, as maybe improvements in models are going to do a lot of the work.
I’m more bullish on investing in exploration of promising workflows and design patterns. As an example, a series of youtube videos and writeups on using O3 as a forecasting aid for grantmaking, with demonstrations. Or a set of examples of using LLMs to aid in productive meetings, with a breakdown of the tech used and social norms that the participants agreed to.
- I think these are much cheaper to do in terms for time and money.
- A lot of epistemics seems to be HCI bottlenecked.
- Good design patterns are easily copyable, which also means they’re probably underinvested in relative to their returns.
- Social diffusion of good epistemic practices will not necessarily hapepn as fast as AI improvements.
- Improving the AIs themselves to be more truth seeking and provide good advice—with good benchmarks—is another avenue.
I imagine a fellowship for prompt engineers and designers, prize competitions, or perhaps retroactive funding for people who have already developed good patterns.

Ben Goldhaber 14 Apr 2025 15:48 UTC
7 points
−1
on: bgold’s Shortform
I think people should write a bunch of their own vignettes set in the AI 2027 universe. Small snippets of life predictions as things get crazy, on specific projects that may or may not bend the curve, etc.

Ben Goldhaber 1 Apr 2025 20:34 UTC
5 points
2
in reply to: Ben Goldhaber’s comment on: Provably Safe AI: Worldview and Projects
fyi @Zac Hatfield-Dodds my probability has fallen below 10% - I expected at least one relevant physical<>cyber project to have started in the past six months, since it hasn’t I doubt this will make the timeline. While not conceding (because I’m still unsure how far AI uplift alone gets us), seems right to note the update.

Ben Goldhaber 14 Mar 2025 22:19 UTC
3 points
0
in reply to: Dalcy’s comment on: bgold’s Shortform
good to know thanks for flagging!

Ben Goldhaber 14 Mar 2025 19:30 UTC
24 points
0
on: bgold’s Shortform
Recently learned about Acquired savant syndrome. https://en.wikipedia.org/wiki/Jason_Padgett
After the attack, Padgett felt “off.” He assumed it was an effect of the medication he was prescribed; but it was later found that, because of his traumatic brain injury, Padgett had signs of obsessive–compulsive disorder and post-traumatic stress disorder.^[5] He also began viewing the world through a figurative lens of mathematical shapes.
“Padgett is one of only 40 people in the world with “acquired savant syndrome,” a condition in which prodigious talents in math, art or music emerge in previously normal individuals following a brain injury or disease.
this makes it seem more likely to me that bio interventions for increases in IQ in adult humans is possible, though likely algernon’s law holds and there’s a cost.
h/t @Jesse Hoogland

Ben Goldhaber 8 Feb 2025 2:55 UTC
1 point
0
in reply to: Zac Hatfield-Dodds’s comment on: In response to critiques of Guaranteed Safe AI
Minor point: It seems unfair to accuse GSAI of being vaporware. It has been less than a year since the GSAI paper came out and 1.5 since Tegmark/Omohundro’s Provably Safe paper, and there are many projects being actively funded through ARIA and others that should serve as tests. No GSAI researchers that I know of promised significant projects in 2024 - in fact several explicitly think the goal should be to do deconfusion and conceptual work now and plan to leverage the advances in autoformalization and AI-assisted coding that are coming down the pipe fast.
While I agree that there are not yet compelling demonstrations, this hardly seems at the level of Duke Nukem Forever!

Ben Goldhaber 31 Jan 2025 19:31 UTC
1 point
0
on: bgold’s Shortform
what are the bottlenecks preventing 10x-100x scaling of Control Evaluations?
- I’m not confident in the estimates of the safety margin we get from internal only evaluations—the challenge of eliciting strong subversion performance seems very hard for getting satisfactory estimates of the subversion capability of models against control protocols.
- I’d feel more confident if we had thousands of people trying to create red-team models, while thousands of blue teams propose different monitoring methods, and control protocols.
- The type of experiments described in (1,2,3) seem particularly amenable to this type of crowdsourcing competitions.
- This seems like it fits naturally into DefCon-esque (my catchall for the hacker community) security competitions; I think this would be very popular, even without a large monetary prize.
- Success would look like having strong belief that a given control procedure has lower and upper bound estimates for catching the best attempts humanity can make at subverting control procedures w/ (red-teamed models, humans), and ideally have an estimate of the cost of such control protocols.
- I think the reason this would fail is:
  - lack of access to the actual frontier models (perhaps a lab could partner with a competition organizer, or rely on good open source models)
  - lack of good data collection necessary to have confidence in the results or outcomes, metrics might be hard to meaningfully create.
  - would the competition reflect the local conditions of frontier labs well enough that labs can generalize from the results? Ideally you have a consortium effort helping to ensure the setup reflects reality.
  - generally operationally difficult to coordinate lots of people.
Are there others?