Yonatan Cale

Karma: 873

Yonatan Cale Apr 8, 2025, 12:34 PM
2 points
0
in reply to: Yonatan Cale’s comment on: Yonatan Cale’s Shortform
So FYI, the emoji I posted above was actually AI generated.
The real one looks like this (on my browser) :
I still think the original one looks a lot like a paperclip, but I cheated and exaggerated this in the April Fool’s version

Yonatan Cale Apr 1, 2025, 4:45 PM
63 points
0
on: Yonatan Cale’s Shortform
Seems like Unicode officially added a “person being paperclipped” emoji:
Here’s how it looks in your browser: 🙂‍↕️
Whether they did this as a joke or to raise awareness of AI risk, I like it!
Source: https://emojipedia.org/emoji-15.1

Yonatan Cale Mar 5, 2025, 6:54 PM
1 point
0
in reply to: lisathiergart’s comment on: Yonatan Cale’s Shortform
Hey,
On anti-trust laws, see this comment. I also hope to have more to share soon

Yonatan Cale Mar 3, 2025, 9:46 PM
1 point
0
in reply to: Yonatan Cale’s comment on: Yonatan Cale’s Shortform
I asked Claude how relevant this is to protecting something like a H100, here are the parts that seem most relevant from my limited understanding:
What the paper actually demonstrates:
1.⁠ ⁠Reading (not modifying) data from antifuse memory in a Raspberry Pi RP2350 microcontroller
2.⁠ ⁠Using Focused Ion Beam (FIB) and passive voltage contrast to extract information

Key differences between this and modifying an H100 GPU:
1. 3D Transistor Structures: Modern 5nm chips use FinFET or GAAFET 3D structures rather than planar transistors. The critical parts are buried within the structure, making them fundamentally more difficult to access without destroying them.
2. Atomic-Scale Limitations: At 5nm, we’re approaching atomic limits (silicon atoms are ~0.2nm). The physics of matter at this scale creates fundamental boundaries that better equipment cannot overcome.
3. Ion Beam Physics: Even with perfect equipment, ion beams create interaction volumes and damage zones that become proportionally larger compared to the target features at smaller nodes.

Yonatan Cale Mar 3, 2025, 9:40 AM
1 point
0
in reply to: Jonathan_H’s comment on: Yonatan Cale’s Shortform
Thanks! Is this true for a somewhat-modern chip that has at least some slight attempt at defense, or more like the chip on a raspberry pi?

Yonatan Cale Feb 28, 2025, 5:08 AM
5 points
3
in reply to: Mikhail Samin’s comment on: Mikhail Samin’s Shortform
(Could you link to the context?)

Yonatan Cale Feb 27, 2025, 10:17 PM
17 points
0
on: Yonatan Cale’s Shortform
Patching security problems in big old organizations involves problems that go a lot beyond “looking at code and changing it”, especially if aiming for a “strong” solution like formal verification.
TL;DR: Political problems, code that makes no sense, problems that would be easy to fix even with a simple LLM that isn’t specialized on improving security.
The best public resource I know is about this is Recoding America.
Some examples iirc:
1. Not having a clear primary key to identify people with.
2. Having a website (a form) that theoretically works but doesn’t run on any browser that people actually use.
3. Having a security component which is supposed to catch fake submissions of forms, but is way more likely to catch real submissions (it is imo net negative).
I also learned some surprising things from working on fixing/rewriting a major bank in Israel. I can’t share such juicy stories as Recoding America publicly, but here are some that I can:
1. “written in kobol” is maybe ~1% of the problem and imo not an interesting pain point to focus on
2. Many systems are microservices (much harder to define the expected functionality of an async system)
  1. Different parts of the bank are written using different technologies
3. An example problem you’d find in the code is “representing amounts of money using javascript floats”
  1. It’s not complicated to fix, technically
  2. But people might say “we’re used to using floats, why change it?”
  3. Making changes might mean they’ll have to run manual tests, so they’ll be against it
  4. Some teams just don’t like having others touch their code, and maybe some care about their job security
4. Another example easy thing to fix is “use enums”
  1. I’ve seen more than one conversations where “agreeing to use an enum” was a big political debate
5. Sometimes, to understand a system, I’d do a combination of “looking at the code”, “looking at the docs (word documents with the design)”, and “talking to one or more of the people in charge of the system” (sometimes different people have different parts of the picture)
[written with the hope that orgs trying to patch security problems will do well]

Yonatan Cale Feb 20, 2025, 6:46 PM
1 point
0
in reply to: Milan W’s comment on: Yonatan Cale’s Shortform
I want the tool to proactively suggest things while working on the document, optimizing for “low friction for getting lots of comments from the LLM”. The tool you suggested does optimize for this property very well

Yonatan Cale Feb 20, 2025, 2:16 AM
1 point
0
in reply to: Milan W’s comment on: Yonatan Cale’s Shortform
1. This is very cool, thanks!
  1. I’m tempted to add Claude support
2. It isn’t exactly what I’m going for. Example use cases I have in mind:
  1. “Here’s a list of projects I’m considering working on, and I’m adding curxes/considerations for each”
  2. “Here’s my new alignment research agenda” (can an AI suggest places where this research is wrong? Seems like checking this would help the Control agenda?)
  3. “Here’s a cost-effectiveness analysis of an org”

Yonatan Cale Feb 20, 2025, 12:54 AM
11 points
1
on: Yonatan Cale’s Shortform
Things I’d suggest to an AI lab CISO if we had 5 minutes to talk
1 minute version:
- I think there are projects that can prepare the lab for moving to an air gapped network (protecting more than model weights) which would be useful to start early, would have minimal impact on developer productivity, and could be (to some extent) delegated^[1]
Extra 4 minutes:
Example categories of such projects:
1. Projects that take serial time but can be done without the final stage that actually hurts developer productivity
  1. Toy example: Add extra ethernet cables to the building but don’t use them yet
2. Reduce uncertainty about the problems that will be caused by a future security measure
  1. Toy example: Prepare for a (partially?) air gapped network by monitoring (with consent^[2]) which domains employees use and finding alternatives to them, e.g:
    Wikipedia --> Download it
    Social media --> Buy some employees a personal use computer, see if they like it?
    … each domain becomes a project to prioritize and delegate, hopefully
3. Projects that require “product market fit” with the engineers
  1. Toy example: The lab wants a secure^[3] way to access model weights^[4]. They can try an MVP solution (github PRs?), get user feedback (“too much friction!”), and work on the next draft while the users go back to accessing the weights however they want.
    Note how much more it would hurt productivity if we’d wait with this project until security became critical and we’d have to force the engineers to use whatever solution we could come up with quickly. This is a common property of many projects I’d suggest.
1. ^
  I’m assuming the CISO’s team has limited focus, but spending this focus on delegating projects is a good deal. I’m also assuming this is a problem they’re happy to solve with money.
2. ^
  I endorse communicating why you want to do this and getting employee agreement, not just randomly following them
3. ^
  e.g monitored
4. ^
  I’m aware this example is more focused on model weights, but it felt shorter to write than other product-market-fit examples. e.g I think “experiment with opening a new office for employees who like to WFH” is more realistic for an air gapped network but was longer for me to explain

Yonatan Cale Feb 17, 2025, 11:04 PM
8 points
2
on: Yonatan Cale’s Shortform
I’m looking for an AI tool which feels like Google Docs but has an LLM proactively commenting/suggesting things.
(Is anyone else interested in something like this?)

Yonatan Cale Feb 7, 2025, 11:42 PM
1 point
0
on: Planning for Extreme AI Risks
This post helped me notice I have incoherent beliefs:
1. “If MAGMA self-destructs, the other labs would look at it with confusion/pity and keep going. That’s not a plan”
2. “MAGMA should self-destruct now even if it’s not leading!”
I think I’ve been avoiding thinking about this.
So what do I actually expect?
If OpenAI (currently in the lead) would say “our AI did something extremely dangerous, this isn’t something we know how to contain, we are shutting down and are calling other labs NOT to train over [amount of compute], and are not discussing the algorithm publicly because of fear the open source community will do this dangerous thing, and we need the government ASAP”, do I expect that to help?
Maybe?
Probably nation states will steal all the models+algorithm+slack as quickly as they can, probably a huge open source movement will protest, but it still sounds possible (15%?) that the major important actors would listen to this, especially if it was accompanies by demos or so?
What if Anthropic or xAI or DeepSeek (not currently in the lead) would shut down now?
...I think they would be ignored.
Does that imply I should help advance the capabilities of the lab most likely to act as you suggest?
Does this imply I should become a major player myself, if I can? If so, should I write on my website that I’m open to a coordinated pause?
Should I give up on being a CooperateBot, given the other players have made it so overwhelmingly clear they are happy to defect?
This is painful to think about, and I’m not sure what’s the right thing to do here.
Open to ideas from anyone.
Anyway, great post, thanks

Yonatan Cale Feb 1, 2025, 7:18 PM
1 point
0
in reply to: Yonatan Cale’s comment on: Yonatan Cale’s Shortform
OpenAI already have this in their charter:
We are concerned about late-stage AGI development becoming a competitive race without time for adequate safety precautions. Therefore, if a value-aligned, safety-conscious project comes close to building AGI before we do, we commit to stop competing with and start assisting this project. We will work out specifics in case-by-case agreements, but a typical triggering condition might be “a better-than-even chance of success in the next two years.”

Yonatan Cale Jan 31, 2025, 11:00 PM
1 point
0
in reply to: Mateusz Bagiński’s comment on: Zach Stein-Perlman’s Shortform
Similar opinion here, also noting they didn’t run red-teaming and persuasion evals on the actually-final-version:
https://x.com/teortaxesTex/status/1885401111659413590

Yonatan Cale Jan 31, 2025, 7:31 PM
2 points
−9
on: Yonatan Cale’s Shortform
I think Control has similar problems to RLHF, where both might fail to generalize out of distribution in similar ways.
This seems important because Control has a central goal of being a fallback for that kind of failure mode.
I want to share my thoughts, including something nice Control does about this problem that I think RLHF could easily incorporate (investigate failures instead of always training against them).
What do I mean by Control generalizing out of distribution:
Our Control experiment might involve giving a model 100 leetcode problems, and we hope to show that (even with help) the model can’t successfully insert a back door without being caught by our Control setup^[1].
- Will this generalize to other leetcode problems? Probably.
- How about more realistic coding tasks that involve confusing legacy code? Who knows.
- How about more open-ended research tasks, like “here’s Ryan’s pitch for a faking-alignment paper, fill in the details and run the experiment”? I, eh, doubt it? Remember—we’re dealing with an AI that might be trying to scheme and mess up our research on purpose.
- If we tested whether the AI can leak sensitive data without being caught, will it generalize to other bad things it might try to do?
- And so on
Control’s soft guarantees could be somewhat achieved using RLHF
I’m writing this both to emphasise the similarity, but also to suggest nice things we could do with RLHF:
1. We generate scenarios where an AI might scheme or not scheme.
  1. We might use a similar Control setup (“adding back doors to code”?), but the monitoring (“control protocol”) doesn’t have to be cheap+fast+scalable enough to run in production.
2. Given e.g 100 scenarios, use RLHF to train on a fraction of them (e.g 20), and check if the RLHF generalized to the rest.
3. If it didn’t generalize, pretend you caught the AI red handed (investigate what happened, publish it publicly, don’t just train against these scenarios)
In general, I’d be excited about labs publishing a naive metric like “what fraction of our RLHF results were needed in order to generalize to the rest”, as well as examples for scenarios that the AI doesn’t seem to generalize to unless you train against them specifically.
For similar reasons, I’d be excited about “we got our Control protocol working on 20% of the scenarios and it successfully generalized to the other 80%” over “we fixed our Control protocol until it worked^[2] on 100% of the scenarios we had”.
Oh, and for both RLHF and Control, I think we should be very interested in “what is the distribution they used”.
Replies I got to this:
“We will only use the Controlled AI to write pretty normal code, so the deployment setup is in-distribution”
Me: “So how would Control help Safety research more than Capabilities? Seems like labs will try to extract AI R&D work regardless.”
“Control will let us run an AI that is smart^[3] enough to cause significant damage to humanity, and extracting coding-work from such an AI is valuable”
Me: “That scenario implies an AI that can scheme in many creative ways. The Control scenario above might have (metaphorically) found the “backdoored code” neuron. That doesn’t seem enough.”
My steelman of Redwood:
1. “Yes, the distribution of scheming examples is super important, here are 2 posts we wrote about it and you missed, and we’ll publish 5 more next week with 14 more distributions of tasks relevant for the kind of alignment research we plan to do with a Controlled AI” (seems like Redwood always have a post directly addressing my concerns..)
2. “We need to start somewhere, and it’s easy to start with backdoored code or with a simple scheming objective like making an HTTP request”
3. “It’s useful to focus on scheming (whether using Control or RLHF) because that leads to other great assumptions”
4. “People are already implementing those RLHF suggestions, here are 20 arxiv papers about it that you missed”
5. “Control gives us less-capabilities with more-failing-loudly, which is a tradeoff we like”
6. “We are trying to have Control give (partial) guarantees that are pretty uncorrelated with RLHF, just as another layer of safety, because it’s the best idea we have so far, not because we actually think this will prevent the AI from scheming in a somewhat general way”
1. ^
  The “control protocol”
2. ^
  “worked” in the context of Control would probably mean “we detected scheming in a very high percent of the scenarios we tested”, but I’d like to even assume the Control protocol detected scheming in 100% of the scenarios tested
3. ^
  Not an ASI, but still very dangerous

Yonatan Cale Jan 29, 2025, 8:48 AM
1 point
0
in reply to: Yonatan Cale’s comment on: Yonatan Cale’s Shortform
An opinion from a former lawyer
[disclaimers: they’re not an anti trust lawyer and definitely don’t take responsibility for this opinion, nor do I. This all might maybe be wrong and we need to speak to an actual anti-trust lawyer to get certainty. I’m not going to put any more disclaimers here, I hope I’m not also misremembering something]
So,
1. Having someone from the U.S government sign that they won’t enforce anti trust laws isn’t enough (even if the president signs), because the (e.g) president might change their mind, or the next president might enforce it retroactively. This is similar to the current situation with Tiktok where Trump said he wouldn’t enforce the law that prevents Google from having Tiktok on their app store, but Google still didn’t put Tiktok back, probably because they’re afraid that someone will change their mind and retroactively enforce the law
2. I asked if the government (e.g president) could sign “we won’t enforce this, and if we change our mind we’ll give a 3 month notice”.
  1. The former-lawyer’s response was to consider whether, in a case the president would change their mind immediately, this signature would hold up in court. He thinks that not, but couldn’t remember an example of something similar happening (which seems relevant)
3. If the law changes (for example, to exclude this letter), that works
  1. (but it’s hard to pass such changes through congress)
4. If the letter is conditional on the law changing, that seems ok
My interpretation of this:
It seems probably possible to find a solution where signing this letter is legal, but we’d have to consult with an anti-trust lawyer.
[reminder that this isn’t legal advice, isn’t confident, is maybe misremembered, and so on]

Yonatan Cale Jan 28, 2025, 11:37 PM
1 point
0
in reply to: ryan_greenblatt’s comment on: Anthropic’s Responsible Scaling Policy & Long-Term Benefit Trust
To the extent you think Anthropic is a good actor, you shouldn’t be pressuring good actors like them to take actions that will make them differentially less competitive than worse actors
I think an important part of how one becomes (and stays) a good actor is by being transparent about things like this.
Anthropic could at least avoid making substantially misleading statements
Yes. But also, I’m afraid that Anthropic might solve this problem by just making less statements (which seems bad). Still Yes

Yonatan Cale Jan 28, 2025, 11:01 PM
1 point
0
on: MONA: Managed Myopia with Approval Feedback
Hey,
In the article, you measured the MONA setup against a pure RL agent.
I’m curious about measuring MONA against the overseer-AI picking the next move directly^[1]: The overseer-AI probably won’t^[2] reward hack more than the MONA setup, but it’s unclear to me if it would also have worse performance.
I could imagine discovering the myopic MONA agent converging on
1. Picking whatever the overseer-AI would pick (since those actions would seem “obviously correct” to the overseer AI and result in the most reward)
2. Picking actions that seem impressive but are useless (bad! we could get better performance, still with ~no reward hacking, by asking the overseer directly^[1])
3. Picking clever useful actions that the overseer wouldn’t have thought of (win!)
1. ^
  I’d give the overseer-AI scaffolding to make it more comparable to the MONA setup, such as the ability to observe the consequences of its own actions and roll them back if they seem bad.
2. ^
  MONA is trying to use the overseer-AI to put an upper bound on how much reward hacking happens, if I understood correctly. If anyone runs this experiment, I’d still validate that assumption, since it seems easy (the paper already has a way to measure reward hacking).
What links here?
- Yonatan Cale's comment on MONA: Managed Myopia with Approval Feedback by Seb Farquhar (Jan 28, 2025, 9:37 PM; 1 point)

Yonatan Cale Jan 28, 2025, 10:00 PM
1 point
0
on: MONA: Managed Myopia with Approval Feedback
nit: I wouldn’t use a prediction market as an overseer because markets are often uninterpretable to humans, which would miss some of the point^[1].
1. ^
  “we show how to get agents whose long-term plans follow strategies that humans can predict”. But maybe no single human actually understands the strategy. Or maybe the traders are correctly guessing that the model’s steps will somehow lead to whatever is defined as a “good outcome”, even if they don’t understand how, which has similar problems to the RL reward from the future that you’re trying to avoid.

Yonatan Cale Jan 28, 2025, 9:37 PM
1 point
0
in reply to: Rohin Shah’s comment on: MONA: Managed Myopia with Approval Feedback
For a simple task like booking a restaurant, we could just ask the (frozen) overseer-AI to pick^[1] actions, no?
The interesting application MONA seems to be when the myopic RL agent is able to produce better suggestions than the overseer
Edit: I elaborated
1. ^
  Plus maybe let the overseer observe the result and say “oops” and roll back that action, if we can implement a rollback in this context

Yonatan Cale

What the paper actually demonstrates:

Key differences between this and modifying an H100 GPU:

1 minute version:

Extra 4 minutes: