orthonormal

Karma: 17,901

Run evals on base models too!

orthonormalApr 4, 2024, 6:43 PM

49 points

6 comments1 min readLW link

orthonormal Apr 4, 2024, 6:18 PM
4 points
0
in reply to: Rohin Shah’s comment on: AI #57: All the AI News That’s Fit to Print
Certainly RLHF can get the model to stop talking about a capability, but usually this is extremely obvious because the model gives you an explicit refusal?
How certain are you that this is always true (rather than “we’ve usually noticed this even though we haven’t explicitly been checking for it in general”), and that it will continue to be so as models become stronger?
It seems to me like additionally running evals on base models is a highly reasonable precaution.

orthonormal Apr 3, 2024, 11:16 PM
4 points
0
in reply to: Rohin Shah’s comment on: AI #57: All the AI News That’s Fit to Print
Oh wait, I misinterpreted you as using “much worse” to mean “much scarier”, when instead you mean “much less capable”.
I’d be glad if it were the case that RL*F doesn’t hide any meaningful capabilities existing in the base model, but I’m not sure it is the case, and I’d sure like someone to check! It sure seems like RL*F is likely in some cases to get the model to stop explicitly talking about a capability it has (unless it is jailbroken on that subject), rather than to remove the capability.
(Imagine RL*Fing a base model to stop explicitly talking about arithmetic; are we sure it would un-learn the rules?)

orthonormal Apr 1, 2024, 4:04 PM
4 points
0
in reply to: Rohin Shah’s comment on: AI #57: All the AI News That’s Fit to Print
That’s exactly the point: if a model has bad capabilities and deceptive alignment, then testing the post-tuned model will return a false negative for those capabilities in deployment. Until we have the kind of interpretability tools that we could deeply trust to catch deceptive alignment, we should count any capability found in the base model as if it were present in the tuned model.

orthonormal Mar 29, 2024, 5:49 PM
4 points
0
on: AI #57: All the AI News That’s Fit to Print
I’d like to see evals like DeepMind’s run against the strongest pre-RL*F base models, since that actually tells you about capability.
What links here?
- Run evals on base models too! by orthonormal (Apr 4, 2024, 6:43 PM; 49 points)

orthonormal Feb 21, 2024, 8:23 PM
6 points
2
in reply to: Ben Pace’s comment on: One True Love
The WWII generation is negligible in 2024. The actual effect is partly the inverted demographic pyramid (older population means more women than men even under normal circumstances), and partly that even young Russian men die horrifically often:
At 2005 mortality rates, for example, only 7% of UK men but 37% of Russian men would die before the age of 55 years
And for that, a major culprit is alcohol (leading to accidents and violence, but also literally drinking oneself to death).
Among the men who don’t self-destruct, I imagine a large fraction have already been taken, meaning that the gender ratio among singles has to be off the charts.

orthonormal Feb 19, 2024, 6:57 AM
8 points
0
on: One True Love
That first statistic, that it swiped right 353 times and got to talk to 160 women, is completely insane. I mean, that’s almost a 50% match rate, whereas estimates in general are 4% to 14%.
Given Russia’s fucked-up gender ratio (2.5 single women for every single man), I don’t think it’s that unreasonable!
Generally, the achievement of “guy finds a woman willing to accept a proposal” impresses me far less in Russia than it would in the USA. Let’s see if this replicates in a competitive dating pool.

orthonormal Feb 2, 2024, 4:25 AM
3 points
0
on: orthonormal’s Shortform
In high-leverage situations, you should arguably either be playing tic-tac-toe (simple, legible, predictable responses) or playing 4-D chess to win. If you’re making really nonstandard and surprising moves (especially in PR), you have no excuse for winding up with a worse outcome than you would have if you’d acted in bog-standard normal ways.
(This doesn’t mean suspending your ethics! Those are part of winning! But if you can’t figure out how to win 4-D chess ethically, then you need to play an ethical tic-tac-toe strategy instead.)

orthonormal Jan 26, 2024, 5:48 PM
4 points
0
in reply to: Dagon’s comment on: orthonormal’s Shortform
Ah, I’m talking about introspection in a therapy context and not about exhorting others.
For example:
Internal coherence: “I forgive myself for doing that stupid thing”.
Load-bearing but opaque: “It makes sense to forgive myself, and I want to, but for some reason I just can’t”.
Load-bearing and clear resistance: “I want other people to forgive themselves for things like that, but when I think about forgiving myself, I get a big NOPE NOPE NOPE”.
P.S. Maybe forgiving oneself isn’t actually the right thing to do at the moment! But it will also be easier to learn that in the third case than in the second.

orthonormal Jan 24, 2024, 11:57 PM
19 points
0
on: orthonormal’s Shortform
“I endorse endorsing X” is a sign of a really promising topic for therapy (or your preferred modality of psychological growth).
If I can simply say “X”, then I’m internally coherent enough on that point.
If I can only say “I endorse X”, then not-X is psychologically load-bearing for me, but often in a way that is opaque to my conscious reasoning, so working on that conflict can be slippery.
But if I can only say “I endorse endorsing X”, then not only is not-X load-bearing for me, but there’s a clear feeling of resistance to X that I can consciously hone in on, connect with, and learn about.

orthonormal Jan 17, 2024, 6:05 PM
2 points
0
on: Medical Roundup #1
Re: Canadian vs American health care, the reasonable policy would be:
“Sorry, publicly funded health care won’t cover this, because the expected DALYs are too expensive. We do allow private clinics to sell you the procedure, though unless you’re super wealthy I think the odds of success aren’t worth the cost to your family.”
(I also approve of euthanasia being offered as long as it’s not a hard sell.)

orthonormal Jan 5, 2024, 8:25 PM
40 points
15
in reply to: Wei Dai’s comment on: MIRI 2024 Mission and Strategy Update
I think MIRI is correct to call it as they see it, both on general principles and because if they turn out to be wrong about genuine alignment progress being very hard, people (at large, but also including us) should update against MIRI’s viewpoints on other topics, and in favor of the viewpoints of whichever AI safety orgs called it more correctly.

orthonormal Nov 22, 2023, 8:38 PM
12 points
2
in reply to: HiddenPrior’s comment on: OpenAI: The Battle of the Board
Prior to hiring Shear, the board offered a merger to Dario Amodei, with Dario to lead the merged entity. Dario rejected the offer.

orthonormal Nov 22, 2023, 12:33 AM
10 points
3
on: Dialogue on the Claim: “OpenAI’s Firing of Sam Altman (And Shortly-Subsequent Events) On Net Reduced Existential Risk From AGI”
I mean, I don’t really care how much e.g. Facebook AI thinks they’re racing right now. They’re not in the game at this point.
The race dynamics are not just about who’s leading. FB is 1-2 years behind (looking at LLM metrics), and it doesn’t seem like they’re getting further behind OpenAI/Anthropic with each generation, so I expect that the lag at the end will be at most a few years.
That means that if Facebook is unconstrained, the leading labs have only that much time to slow down for safety (or prepare a pivotal act) as they approach AGI before Facebook gets there with total recklessness.
If Microsoft!OpenAI lags the new leaders by less than FB (and I think that’s likely to be the case), that shortens the safety window further.
I suspect my actual crux with you is your belief (correct me if I’m misinterpreting you) that your research program will solve alignment and that it will not take much of a safety window for the leading lab to incorporate the solution, and therefore the only thing that matters is finishing the solution and getting the leading lab on board. It would be very nice if you were right, but I put a low probability on it.

orthonormal Nov 21, 2023, 8:18 PM
42 points
7
on: OpenAI: Facts from a Weekend
I’m surprised that nobody has yet brought up the development that the board offered Dario Amodei the position as a merger with Anthropic (and Dario said no!).
(There’s no additional important content in the original article by The Information, so I linked the Reuters paywall-free version.)
Crucially, this doesn’t tell us in what order the board made this offer to Dario and the other known figures (GitHub CEO Nat Friedman and Scale AI CEO Alex Wang) before getting Emmett Shear, but it’s plausible that merging with Anthropic was Plan A all along. Moreover, I strongly suspect that the bad blood between Sam and the Anthropic team was strong enough that Sam had to be ousted in order for a merger to be possible.
So under this hypothesis, the board decided it was important to merge with Anthropic (probably to slow the arms race), booted Sam (using the additional fig leaf of whatever lies he’s been caught in), immediately asked Dario and were surprised when he rejected them, did not have an adequate backup plan, and have been scrambling ever since.
P.S. Shear is known to be very much on record worrying that alignment is necessary and not likely to be easy; I’m curious what Friedman and Wang are on record as saying about AI x-risk.

orthonormal Nov 21, 2023, 8:02 PM
5 points
0
in reply to: Chess3D’s comment on: OpenAI: Facts from a Weekend
No, I don’t think the board’s motives were power politics; I’m saying that they failed to account for the kind of political power moves that Sam would make in response.

orthonormal Nov 20, 2023, 8:58 PM
10 points
2
in reply to: johnswentworth’s comment on: Sam Altman, Greg Brockman and others from OpenAI join Microsoft
In addition to this, Microsoft will exert greater pressure to extract mundane commercial utility from models, compared to pushing forward the frontier. Not sure how much that compensates for the second round of evaporative cooling of the safety-minded.

orthonormal Nov 20, 2023, 8:13 PM
33 points
20
in reply to: Lucius Bushnaq’s comment on: OpenAI: Facts from a Weekend
If they thought this would be the outcome of firing Sam, they would not have done so.
The risk they took was calculated, but man, are they bad at politics.

orthonormal Nov 20, 2023, 8:10 PM
26 points
9
in reply to: ryan_b’s comment on: OpenAI: Facts from a Weekend
1. The quote is from Emmett Shear, not a board member.
2. The board is also following the “don’t say anything literally false” policy by saying practically nothing publicly.
3. Just as I infer from Shear’s qualifier that the firing did have something to do with safety, I infer from the board’s public silence that their reason for the firing isn’t one that would win back the departing OpenAI members (or would only do so at a cost that’s not worth paying).
4. This is consistent with it being a safety concern shared by the superalignment team (who by and large didn’t sign the statement at first) but not by the rest of OpenAI (who view pushing capabilities forward as a good thing, because like Sam they believe the EV of OpenAI building AGI is better than the EV of unilaterally stopping). That’s my current main hypothesis.

orthonormal Nov 20, 2023, 6:27 PM
14 points
3
in reply to: Lukas_Gloor’s comment on: OpenAI: Facts from a Weekend
It’s too late for a conditional surrender now that Microsoft is a credible threat to get 100% of OpenAI’s capabilities team; Ilya and Jan are communicating unconditional surrender because the alternative is even worse.

orthonormal

Run evals on base mod­els too!

Run evals on base models too!