Yoav Hollander

Karma: 46

Yoav Hollander 17 Jun 2026 14:14 UTC
1 point
0
on: Automated Alignment is Harder Than You Think
I come from an adjacent field—verification and validation of “physical AI”—so I read this partly as a verification problem. On genuinely unsupervisable fuzzy tasks, such as judging novel research agendas or paradigm-level reframings, I don’t think V&V gives a magic answer—those seem as hard as you say.
But I think one important subproblem in your bucket—assessing whether a system behaves acceptably across the situations that matter—is more supervisable than the framing may suggest. Coverage-driven verification lets you turn “did we exercise the relevant situations, and did behavior hold across them?” into a measured property rather than a single global judgement. Mature V&V also has machinery for validating the checker itself, and an active (human + AI) process for discovering missing coverage dimensions—the “spec bugs” that are obvious in hindsight but not enumerable upfront (see link below).
This doesn’t touch correlated-evidence aggregation. It helps only partially with paradigm choice (it can surface new coverage dimensions, though not replace the framing they sit in). On proxies it helps with coverage, not with whether the proxy is relevant to alignment at all. And coverage discovery is unbounded—not a proof of completeness. But assuming scheming is off the table (as you suggest for this discussion), many concrete failures, once found using the machinery above, convert into ordinary spec checks. So I’d suggest the genuinely irreducible part is narrower than the full fuzzy-task set, concentrated where a system’s behavior changes because it’s being measured.
Fuller version here.

Yoav Hollander 15 Jun 2026 8:33 UTC
1 point
0
in reply to: Yuxin Zhang’s comment on: Coverage-driven alignment—What ‘Teaching Claude Why’ can borrow from AV verification
Thanks Yuxin—and thanks especially for the description in your post of the mapping into “Robot SOTIF”, and how that might play in China’s standards-driven environment. You also wrote that if the coverage maps and risk assessments produced by CDV can become evidence of “reasonable care”—just like safety cases in autonomous driving—then alignment V&V gains an institutional incentive base.
That incentives part is outside my area of expertise, and yet it is crucial. Without it, rigorous alignment V&V loses to “ship faster” (in all jurisdictions). In AV-land what makes expensive, systematic V&V rational is the well-established “incident → investigation → someone is liable” loop, but there’s no AI analog yet.
The most promising hook I know of is the work trying to close the “responsibility gap” by attaching an AI agent’s actions to a human or corporate principal. Note this is mostly not aimed at the labs building general-purpose models, but rather at whoever deploys a specific AI-for-something (an AI CEO, a medical AI, a delivery robot) and thereby becomes the identifiable principal (and that specificity also makes the coverage map more tractable). If that holds, CDV-style coverage maps and risk-per-bucket claims can become exactly the “reasonable care” record such a principal would need.
If anyone reading this works on AI governance, algorithm assessment, or liability and sees a way to make rigorous V&V the path of least resistance rather than a cost center, I’d very much like to talk.

Yoav Hollander 8 Jun 2026 19:16 UTC
1 point
0
on: Announcing Geodesic Research
Your adversary—“agentic RL with misspecified rewards”—is exactly what I’ve been working on, from a different field (coverage-driven verification in AV and chip safety). One distinction that might be useful: Misspecified rewards split into ones you could have anticipated (findable by denser testing) and ones where the spec was simply silent on a contingency (price collusion, a mid-campaign rule change, a move nobody enumerated). The second kind is more dangerous, because a stress-test built from misspecifications you can author can’t contain them by construction. The post I just wrote about it calls the second kind “spec bugs”, talks about “can you enumerate the dimensions of an open agent”, and suggests enhancing “Teaching Claude Why” with a coverage-driven adversarial RL pipeline. Here it is—would be curious what you make of it.

Yoav Hollander 8 Jun 2026 16:57 UTC
10 points
0
in reply to: Davidmanheim’s comment on: Coverage-driven alignment—What ‘Teaching Claude Why’ can borrow from AV verification
Thanks David. I’d put it slightly differently: CDV isn’t trying to make the model more ethical through iteration (you’re right that experience doesn’t make a CEO more ethical). It is trying to find out where the character training actually held and where it silently didn’t. Even if you’re fully betting on character, you still need to know whether that character generalizes to (say) the low-oversight / conflicting-incentive / multi-agent region, and the only efficient way I know to find that out is via systematic, CDV-style sampling.
I’d actually argue virtue ethics is the case where this matters most, not least. A character bet leaves the spec maximally implicit: “be of good character” says nothing explicit about price collusion or a mid-campaign rule change. Those are what I call spec bugs in the post—regions nobody thought to enumerate. So the more you lean on character rather than an enumerated behavioral spec, the more unmapped territory you have, and the more you need coverage discovery to surface it.
Finally, you are absolutely right that “guarding against alignment deterioration during RL” is another important consideration. In fact, while writing the post I challenged myself to come up with techniques which have a chance of countering @evhub’s long-horizon RL fears (e.g. the AI CEO).

Yoav Hollander 8 Jun 2026 13:54 UTC
1 point
0
on: Alignment remains a hard, unsolved problem
Here is a new post I wrote specifically about handling the hard AI-CEO case. Comments welcome.

Yoav Hollander 10 Dec 2025 17:11 UTC
2 points
0
in reply to: Steve Kommrusch’s comment on: Alignment remains a hard, unsolved problem
Right – I was also associating this in my mind with his ‘generalization science’ suggestion. However, I think he mainly talks about measuring / predicting generalization (and so does the referenced “Influence functions” post). My main thrust (see also the paper) is a principled methodology for causing the model to “generalize in the direction we want”.

Yoav Hollander 8 Dec 2025 12:02 UTC
2 points
1
on: Alignment remains a hard, unsolved problem
V&V from physical autonomy might address a meaningful slice of the long-horizon RL / deceptive-alignment worry. In AVs we don’t assume the system will “keep speed but drop safety” at deployment just because it can detect deployment; rather, training-time V&V repeatedly catches and penalizes those proto-strategies, so they’re less likely to become stable learned objectives.
Applied to AI CEOs: the usual framing (“it’ll keep profit-seeking but drop ethics in deployment”) implicitly assumes a power-seeking mesa-objective M emerges for which profit is instrumental and ethics is constraining. If strong training-time V&V consistently rewards profit+ethics together and catches early “cheat within legal bounds” behavior, a misaligned M is less likely to form in the first place. This is about shaping, not deploy-time deterrence; I’m not relying on the model being unable to tell test vs deploy.
A plausible architecture is A builds B, where a more capable model A is trained via repeated “build a system + verify it” cycles (meta-V&V), shaping A to value the verification process itself. A then constructs narrower agents B (e.g., CEO). When A > B, verifier/verified asymmetry plus direct V&V on B gives real leverage, provided audits are hard to anticipate, and specs/coverage are good.
AI CEOs (and similar long-horizon agents) are obviously much harder to spec, simulate and check than AVs. I assume both AIs and people are involved in the iterative V&V / spec refinement loop—see more background on V&V for alignment here.
Not claiming a solution: V&V can’t guarantee inner alignment; specs/coverage are hard; defense-in-depth is needed; and A > B may fail near AGI. But in this long-horizon RL setting, training-time V&V could (60% confidence) substantially reduce inner-misalignment risk by ensuring power-seeking/deception get caught early, before they become stable learned objectives.

Yoav Hollander 30 Jun 2025 11:52 UTC
3 points
0
on: The V&V method—A step towards safer AGI
Got several variants of the following question: “You say you have no solution for scheming, which could make the V&V method unworkable. Why do you then claim it is a practical method?”
Here is my answer: Scheming could in theory break any oversight method, but the V&V method does make several anti-scheming contributions:
(1) Its scenario-based, coverage-driven tests flush out most specification gaming tricks (a milder form of reward hacking) automatically, essentially treating them as bugs (see § 4.2) (2) It plugs into other frameworks—e.g. in CAI it supplies the “external reality check” that prevents model-collapse (see § 3.1). (3) Because every run produces a human-auditable safety-case and logged sim traces, and because it is easy to request virtual sims of new scenarios, spot-audits are cheap—giving schemers a constant risk of exposure. (4) These spot-audits can make various anti-collusion approaches (like whistle-blower, rotation and diversity) more effective.
That is why I still call it a practical layer, even while we attack scheming head-on elsewhere. Would love to hear opinions about these points.

Yoav Hollander 12 Apr 2023 11:04 UTC
1 point
0
on: “Carefully Bootstrapped Alignment” is organizationally hard
Summary: My intuition is that “High Reliability Organizations” may not be the best parallel here: A better one is probably “organizations developing new high-tech systems where the cost of failure is extremely high”. Examples are organizations involved in chip design and AV (Autonomous Vehicle) design.
I’ll explain below why I think they are a better parallel, and what we can learn from them. But first:
Some background notes:
1. I have spent many years working in those industries, and in fact participated in inventing some of the related verification / validation / safety techniques (“V&V techniques” for short).
2. Chip design and AV design are different. Also, AV design (and the related V&V techniques) are still work-in-progress – I’ll present a slightly-idealized version of it.
3. I am not sure that “careful bootstrapped alignment”, as described, will work, for the various reasons Eliezer and others are worried about: We may not have enough time, and enough world-wide coordination. However, for the purpose of this thread, I’ll ignore that, and do my best to (hopefully) help improve it.
Why this is a better parallel: Organizations which develop new chips / AVs / etc. have a process (and related culture) of “creating something new, in stages, while being very careful to avoid bugs”. The cost-of-failure is huge: A chip design project / company could die if too many bugs are “left in” (though safety is usually not a major concern). Similarly, an AV project could die if too many bugs (mostly safety-related) cause too many visible failures (e.g. accidents).
And when such a project fails, a few billion dollars could go up in smoke. So a very high-level team (including the CEO) needs to review the V&V evidence and decide whether to deploy / wait / deploy-reduced-version.
How they do it: Because the stakes are so high, these organizations are often split into a design team, and an (often bigger) V&V team. The V&V team is typically more inventive and enterprising (and less prone to Goodharting and “V&V theatre”) than the corresponding teams in “High Reliability Organizations” (HROs).
Note that I am not implying that people in HROs are very prone to those things – it is all a matter of degree: The V&V teams I describe are simply incentivized to find as many “important” bugs as possible per day (given finite compute resources). And they work on a short (several years), very intense schedule.
They employ techniques like a (constantly-updated) verification plan and safety case. They also work in stages: Your initial AV may be deployed only in specific areas / weathers / time-of-day and so on. As you gain experience, you “enlarge” the verification plan / safety case, and start testing accordingly (mostly virtually). Only when you feel comfortable with that do you actually “open up” the area / weather / number-of-vehicles / etc. envelope.
Will be happy to talk more about this.