Fabien Roger

Karma: 2,519

Fabien Roger 1 May 2024 10:07 UTC
4 points
0
in reply to: ryan_greenblatt’s comment on: Questions for labs
What do you expect to be expensive? The engineer hours to build the fine-tuning infra? Or the actual compute for fine-tuning?
Given the amount of internal fine-tuning experiments going on for safety stuff, I’d be surprised if the infra was a bottleneck, though maybe there is a large overhead in making these find-tuned models available through an API.
I’d be even more surprised if the cost of compute was significant compared to the rest of the activity the lab is doing (I think fine-tuning on a few thousand sequences is often enough for capabilities’ evaluations, you rarely need massive training runs).

Fabien Roger 26 Apr 2024 12:38 UTC
LW: 13 AF: 6
0
AF
on: Fabien’s Shortform
List sorting does not play well with few-shot mostly doesn’t replicate with davinci-002.
When using length-10 lists (it crushes length-5 no matter the prompt), I get:
- 32-shot, no fancy prompt: ~25%
- 0-shot, fancy python prompt: ~60%
- 0-shot, no fancy prompt: ~60%
So few-shot hurts, but the fancy prompt does not seem to help. Code here.
I’m interested if anyone knows another case where a fancy prompt increases performance more than few-shot prompting, where a fancy prompt is a prompt that does not contain information that a human would use to solve the task. This is because I’m looking for counterexamples to the following conjecture: “fine-tuning on k examples beats fancy prompting, even when fancy prompting beats k-shot prompting” (for a reasonable value of k, e.g. the number of examples it would take a human to understand what is going on).

Fabien Roger 26 Apr 2024 12:10 UTC
3 points
0
in reply to: Oliver Daniels-Koch’s comment on: Benchmarks for Detecting Measurement Tampering [Redwood Research]
That’s right. We initially thought it might be important so that the LLM “understood” the task better, but it didn’t matter much in the end. The main hyperparameters for our experiments are in train_ray.py, where you can see that we use a “token_loss_weight” of 0.
(Feel free to ask more questions!)

Fabien Roger 23 Apr 2024 17:08 UTC
23 points
2
on: Fabien’s Shortform
I recently listened to The Righteous Mind. It was surprising to me that many people seem to intrinsically care about many things that look very much like good instrumental norms to me (in particular loyalty, respect for authority, and purity).
The author does not make claims about what the reflective equilibrium will be, nor does he explain how the liberals stopped considering loyalty, respect, and purity as intrinsically good (beyond “some famous thinkers are autistic and didn’t realize the richness of the moral life of other people”), but his work made me doubt that most people will have well-being-focused CEV.
The book was also an interesting jumping point for reflection about group selection. The author doesn’t make the sorts of arguments that would show that group selection happens in practice (and many of his arguments seem to show a lack of understanding of what opponents of group selection think—bees and cells cooperating is not evidence for group selection at all), but after thinking about it more, I now have more sympathy for group-selection having some role in shaping human societies, given that (1) many human groups died, and very few spread (so one lucky or unlucky gene in one member may doom/save the group) (2) some human cultures may have been relatively egalitarian enough when it came to reproductive opportunities that the individual selection pressure was not that big relative to group selection pressure and (3) cultural memes seem like the kind of entity that sometimes survive at the level of the group.
Overall, it was often a frustrating experience reading the author describe a descriptive theory of morality and try to describe what kind of morality makes a society more fit in a tone that often felt close to being normative / fails to understand that many philosophers I respect are not trying to find a descriptive or fitness-maximizing theory of morality (e.g. there is no way that utilitarians think their theory is a good description of the kind of shallow moral intuitions the author studies, since they all know that they are biting bullets most people aren’t biting, such as the bullet of defending homosexuality in the 19th century).

Fabien Roger 22 Apr 2024 16:25 UTC
LW: 6 AF: 5
−2
AF
on: Discriminating Behaviorally Identical Classifiers: a model problem for applying interpretability to scalable oversight
1. Hard DBIC: you have no access to any classification data in $D ∖ D_{a}$
2. Relaxed DBIC: you have access to classification inputs $x$ from $D ∖ D_{a}$ , but not to any labels.
SHIFT as a technique for (hard) DBIC
You use pile data points to build the SAE and its interpretations, right? And I guess the pile does contain a bunch of examples where the biased and unbiased classifiers would not output identical outputs—if that’s correct, I expect SAE interpretation works mostly because of these inputs (since SAE nodes are labeled using correlational data only). Is that right? If so, it seems to me that because of the SAE and SAE interpretation steps, SHIFT is a technique that is closer in spirit to relaxed DBIC (or something in between if you use a third dataset that does not literally use $D_{a}$ but something that teaches you something more than just $D$ - in the context of the paper, it seems that the broader dataset is very close to covering $D_{a}$ ).

Fabien Roger 22 Apr 2024 13:35 UTC
3 points
0
in reply to: TheManxLoiner’s comment on: Some ML-Related Math I Now Understand Better
Oops, that’s what I meant, I’ll make it more clear.

Fabien Roger 22 Apr 2024 13:05 UTC
3 points
0
in reply to: Oliver Daniels-Koch’s comment on: Benchmarks for Detecting Measurement Tampering [Redwood Research]
I think this is what you are looking for

Fabien Roger 15 Apr 2024 14:48 UTC
LW: 4 AF: 3
4
AF
in reply to: romeostevensit’s comment on: Fabien’s Shortform
By Knightian uncertainty, I mean “the lack of any quantifiable knowledge about some possible occurrence” i.e. you can’t put a probability on it (Wikipedia).
The TL;DR is that Knightian uncertainty is not a useful concept to make decisions, while the use subjective probabilities is: if you are calibrated (which you can be trained to become), then you will be better off taking different decisions on p=1% “Knightian uncertain events” and p=10% “Knightian uncertain events”.
For a more in-depth defense of this position in the context of long-term predictions, where it’s harder to know if calibration training obviously works, see the latest scott alexander post.

Fabien Roger 11 Apr 2024 22:35 UTC
3 points
2
in reply to: lukehmiles’s comment on: Fabien’s Shortform
For the product of random variables, there are close form solutions for some common distributions, but I guess Monte-Carlo simulations are all you need in practice (+ with Monte-Carlo can always have the whole distribution, not just the expected value).

Fabien Roger 11 Apr 2024 20:56 UTC
LW: 48 AF: 21
7
AF
on: Fabien’s Shortform
I listened to The Failure of Risk Management by Douglas Hubbard, a book that vigorously criticizes qualitative risk management approaches (like the use of risk matrices), and praises a rationalist-friendly quantitative approach. Here are 4 takeaways from that book:
- There are very different approaches to risk estimation that are often unaware of each other: you can do risk estimations like an actuary (relying on statistics, reference class arguments, and some causal models), like an engineer (relying mostly on causal models and simulations), like a trader (relying only on statistics, with no causal model), or like a consultant (usually with shitty qualitative approaches).
- The state of risk estimation for insurances is actually pretty good: it’s quantitative, and there are strong professional norms around different kinds of malpractice. When actuaries tank a company because they ignored tail outcomes, they are at risk of losing their license.
- The state of risk estimation in consulting and management is quite bad: most risk management is done with qualitative methods which have no positive evidence of working better than just relying on intuition alone, and qualitative approaches (like risk matrices) have weird artifacts:
  - Fuzzy labels (e.g. “likely”, “important”, …) create illusions of clear communication. Just defining the fuzzy categories doesn’t fully alleviate that (when you ask people to say what probabilities each box corresponds to, they often fail to look at the definition of categories).
  - Inconsistent qualitative methods make cross-team communication much harder.
  - Coarse categories mean that you introduce weird threshold effects that sometimes encourage ignoring tail effects and make the analysis of past decisions less reliable.
  - When choosing between categories, people are susceptible to irrelevant alternatives (e.g. if you split the “5/5 importance (loss > $1M)” category into “5/5 ($1-10M), ⁵⁄₆ ($10-100M), ⁵⁄₇ (>$100M)”, people answer a fixed “1/5 (<10k)” category less often).
  - Following a qualitative method can increase confidence and satisfaction, even in cases where it doesn’t increase accuracy (there is an “analysis placebo effect”).
  - Qualitative methods don’t prompt their users to either seek empirical evidence to inform their choices.
  - Qualitative methods don’t prompt their users to measure their risk estimation track record.
- Using quantitative risk estimation is tractable and not that weird. There is a decent track record of people trying to estimate very-hard-to-estimate things, and a vocal enough opposition to qualitative methods that they are slowly getting pulled back from risk estimation standards. This makes me much less sympathetic to the absence of quantitative risk estimation at AI labs.
A big part of the book is an introduction to rationalist-type risk estimation (estimating various probabilities and impact, aggregating them with Monte-Carlo, rejecting Knightian uncertainty, doing calibration training and predictions markets, starting from a reference class and updating with Bayes). He also introduces some rationalist ideas in parallel while arguing for his thesis (e.g. isolated demands for rigor). It’s the best legible and “serious” introduction to classic rationalist ideas I know of.
The book also contains advice if you are trying to push for quantitative risk estimates in your team / company, and a very pleasant and accurate dunk on Nassim Taleb (and in particular his claims about models being bad, without a good justification for why reasoning without models is better).
Overall, I think the case against qualitative methods and for quantitative ones is somewhat strong, but it’s far from being a slam dunk because there is no evidence of some methods being worse than others in terms of actual business outputs. The author also fails to acknowledge and provide conclusive evidence against the possibility that people may have good qualitative intuitions about risk even if they fail to translate these intuitions into numbers that make any sense (your intuition sometimes does the right estimation and math even when you suck at doing the estimation and math explicitly).

Fabien Roger 10 Apr 2024 21:36 UTC
LW: 19 AF: 13
7
AF
on: Davidad’s Bold Plan for Alignment: An In-Depth Explanation
I don’t think I understand what is meant by “a formal world model”.
For example, in the narrow context of “I want to have a screen on which I can see what python program is currently running on my machine”, I guess the formal world model should be able to detect if the model submits an action that exploits a zero-day that tampers with my ability to see what programs are running. Does that mean that the formal world model has to know all possible zero-days? Does that mean that the software and the hardware have to be formally verified? Are formally verified computers roughly as cheap as regular computers? If not, that would be a clear counter-argument to “Davidad agrees that this project would be one of humanity’s most significant science projects, but he believes it would still be less costly than the Large Hadron Collider.”
Or is the claim that it’s feasible to build a conservative world model that tells you “maybe a zero-day” very quickly once you start doing things not explicitly within a dumb world model?
I feel like this formally-verifiable computers claim is either a good counterexample to the main claims, or an example that would help me understand what the heck these people are talking about.

Fabien Roger 5 Apr 2024 15:59 UTC
LW: 2 AF: 1
0
AF
in reply to: Fabien Roger’s comment on: Fabien’s Shortform
The full passage in this tweet thread (search for “3,000”).

Fabien Roger 5 Apr 2024 15:55 UTC
LW: 2 AF: 1
0
AF
in reply to: Buck’s comment on: Fabien’s Shortform
I remembered mostly this story:
[...] The NSA invited James Gosler to spend some time at their headquarters in Fort Meade, Maryland in 1987, to teach their analysts [...] about software vulnerabilities. None of the NSA team was able to detect Gosler’s malware, even though it was inserted into an application featuring only 3,000 lines of code. [...]
[Taken from this summary of this passage of the book. The book was light on technical detail, I don’t remember having listened to more details than that.]
I didn’t realize this was so early in the story of the NSA, maybe this anecdote teaches us nothing about the current state of the attack/defense balance.

Fabien Roger 4 Apr 2024 4:42 UTC
LW: 67 AF: 25
2
AF
on: Fabien’s Shortform
I listened to the book This Is How They Tell Me the World Ends by Nicole Perlroth, a book about cybersecurity and the zero-day market. It describes in detail the early days of bug discovery, the social dynamics and moral dilemma of bug hunts.
(It was recommended to me by some EA-adjacent guy very worried about cyber, but the title is mostly bait: the tone of the book is alarmist, but there is very little content about potential catastrophes.)
My main takeaways:
- Vulnerabilities used to be dirt-cheap (~$100) but are still relatively cheap (~$1M even for big zero-days);
- If you are very good at cyber and extremely smart, you can hide vulnerabilities in 10k-lines programs in a way that less smart specialists will have trouble discovering even after days of examination—code generation/analysis is not really defense favored;
- Bug bounties are a relatively recent innovation, and it felt very unnatural to tech giants to reward people trying to break their software;
- A big lever companies have on the US government is the threat that overseas competitors will be favored if the US gov meddles too much with their activities;
- The main effect of a market being underground is not making transactions harder (people find ways to exchange money for vulnerabilities by building trust), but making it much harder to figure out what the market price is and reducing the effectiveness of the overall market;
- Being the target of an autocratic government is an awful experience, and you have to be extremely careful if you put anything they dislike on a computer. And because of the zero-day market, you can’t assume your government will suck at hacking you just because it’s a small country;
- It’s not that hard to reduce the exposure of critical infrastructure to cyber-attacks by just making companies air gap their systems more—Japan and Finland have relatively successful programs, and Ukraine is good at defending against that in part because they have been trying hard for a while—but it’s a cost companies and governments are rarely willing to pay in the US;
- Electronic voting machines are extremely stupid, and the federal gov can’t dictate how the (red) states should secure their voting equipment;
- Hackers want lots of different things—money, fame, working for the good guys, hurting the bad guys, having their effort be acknowledged, spite, … and sometimes look irrational (e.g. they sometimes get frog-boiled).
- The US government has a good amount of people who are freaked out about cybersecurity and have good warning shots to support their position. The main difficulty in pushing for more cybersecurity is that voters don’t care about it.
  - Maybe the takeaway is that it’s hard to build support behind the prevention of risks that 1. are technical/abstract and 2. fall on the private sector and not individuals 3. have a heavy right tail. Given these challenges, organizations that find prevention inconvenient often succeed in lobbying themselves out of costly legislation.
Overall, I don’t recommend this book. It’s very light on details compared to The Hacker and the State despite being longer. It targets an audience which is non-technical and very scope insensitive, is very light on actual numbers, technical details, real-politic considerations, estimates, and forecasts. It is wrapped in an alarmist journalistic tone I really disliked, covers stories that do not matter for the big picture, and is focused on finding who is in the right and who is to blame. I gained almost no evidence either way about how bad it would be if the US and Russia entered a no-holds-barred cyberwar.

Fabien Roger 22 Mar 2024 14:37 UTC
LW: 4 AF: 2
0
AF
in reply to: joshc’s comment on: New report: Safety Cases for AI
My bad for testbeds, I didn’t have in mind that you were speaking about this kind of testbeds as opposed to the general E[U|not scheming] analogies (and I forgot you had put them at medium strength, which is sensible for these kinds of testbeds). Same for “the unwarranted focus on claim 3”—it’s mostly because I misunderstood what the countermeasures were trying to address.
I think I don’t have a good understanding of the macrosystem risks you are talking about. I’ll look at that more later.
I think I was a bit unfair about the practicality of techniques that were medium-strength—it’s true that you can get some evidence for safety (maybe 0.3 bits to 1 bit) by using the techniques in a version that is practical.
On practicality and strength, I think there is a rough communication issue here: externalized reasoning is practical, but it’s currently not strong—and it could eventually become strong, but it’s not practical (yet). The same goes for monitoring. But when you write the summary, we see “high practicality and high max strength”, which feels to me like it implies it’s easy to get medium-scalable safety cases that get you acceptable levels of risks by using only one or two good layers of security—which I think is quite wild even if acceptable=[p(doom)<1%]. But I guess you didn’t mean that, and it’s just a weird quirk of the summarization?

Fabien Roger 22 Mar 2024 2:34 UTC
LW: 61 AF: 23
4
AF
on: Fabien’s Shortform
I just finished listening to The Hacker and the State by Ben Buchanan, a book about cyberattacks, and the surrounding geopolitics. It’s a great book to start learning about the big state-related cyberattacks of the last two decades. Some big attacks /leaks he describes in details:
- Wire-tapping/passive listening efforts from the NSA, the “Five Eyes”, and other countries
- The multi-layer backdoors the NSA implanted and used to get around encryption, and that other attackers eventually also used (the insecure “secure random number” trick + some stuff on top of that)
- The shadow brokers (that’s a *huge* leak that went completely under my radar at the time)
- Russia’s attacks on Ukraine’s infrastructure
- Attacks on the private sector for political reasons
- Stuxnet
- The North Korea attack on Sony when they released a documentary criticizing their leader, and misc North Korean cybercrime (e.g. Wannacry, some bank robberies, …)
- The leak of Hillary’s emails and Russian interference in US politics
- (and more)
Main takeaways (I’m not sure how much I buy these, I just read one book):
- Don’t mess with states too much, and don’t think anything is secret—even if you’re the NSA
- The US has a “nobody but us” strategy, which states that it’s fine for the US to use vulnerabilities as long as they are the only one powerful enough to find and use them. This looks somewhat nuts and naive in hindsight. There doesn’t seem to be strong incentives to protect the private sector.
- There are a ton of different attack vectors and vulnerabilities, more big attacks than I thought, and a lot more is publicly known than I would have expected. The author just goes into great details about ~10 big secret operations, often speaking as if he was an omniscient narrator.
- Even the biggest attacks didn’t inflict that much (direct) damage (never >10B in damage?) Unclear if it’s because states are holding back, if it’s because they suck, or if it’s because it’s hard. It seems that even when attacks aim to do what some people fear the most (e.g. attack infrastructure, …) the effect is super underwhelming.
  - The bottleneck in cyberattacks is remarkably often the will/the execution, much more than actually finding vulnerabilities/entry points to the victim’s network.
  - The author describes a space where most of the attacks are led by clowns that don’t seem to have clear plans, and he often seems genuinely confused why they didn’t act with more agency to get what they wanted (does not apply to the NSA, but does apply to a bunch of Russia/Iran/Korea-related attacks)
- Cyberattacks are not amazing tools to inflict damage or to threaten enemies if you are a state. The damage is limited, and it really sucks that (usually) once you show your capability, it reduces your capability (unlike conventional weapons). And states don’t like to respond to such small threats. The main effect you can have is scaring off private actors from investing in a country / building ties with a country and its companies, and leaking secrets of political importance.
- Don’t leak secrets when the US presidential election is happening if they are unrelated to the election, or nobody will care.
(The author seems to be a big skeptic of “big cyberattacks” / cyberwar, and describes cyber as something that always happens in the background and slowly shapes the big decisions. He doesn’t go into the estimated trillion dollar in damages of everyday cybercrime, nor the potential tail risks of cyber.)

Fabien Roger 21 Mar 2024 18:09 UTC
LW: 18 AF: 12
6
AF
on: New report: Safety Cases for AI
I really liked this report! I think it fills a natural and important hole in the public writing about AI control and makes interesting claims/recommendations.
A few innovations I liked:
- Risk cases—I think that’s plausibly an amazing policy?
- Calling it “black swans” when an AI decides it’s time to take over (even if the distribution has not meaningfully shifted) (“Black swans are rare inputs that cause AI systems to catastrophically misbehave. Black swans may be rare enough to where they are not sampled during evaluation but are common enough to occur in deployment”).
- Calling the crazy bootstrapping hopes “deference”. I like the vibes of this word. Deference is still somewhat counterintuitive to me. My first reaction when reading “AI advisors assert that other AI systems do not pose a catastrophic risk, and these AI advisors are at least as credible as human decision-makers.” was to think “human-level judgement is a super low bar when you’re speaking of the safety of super powerful systems”, but the default mistrust of super powerful AIs would naturally be taken into account by AI advisors if they are at least as credible as human decision-makers.
- Linking GSN diagrams and other things that are common-practice in other industries with AI safety.
- The report is beautiful, I like the little inline figures, and Figure 14 is a nice illustration of catch-then-shutdown approaches.
- You mention unlearning, a subfield that lacks love and attention. I think its characterization as medium-practical and medium-strength is fair (especially given how generous he is with other approaches), but it’s not very scalable imo (it breaks when AIs can derive things from scratch quite well).
Some elements I disagree with:
- I basically disagree with most hopes described in the trustworthiness section:
  - The level of practicality you assign to some approaches is just insanely high. Neither modeling generalization nor externalized reasoning, and certainly not testbeds seem to “strongly practical” (defined as “Applies to arguments that do not require fundamental research advances, but might require”). The only one that is justifiably “strongly practical” to me is “absence of precursor abilities”.
  - I disagree with putting testbeds and amplification in trustworthiness (at least not as strong or even medium practicality), I think they usually don’t help to know if AIs can be trusted (and in particular they don’t prevent any forms of alignment faking).
  - “Claim 4: AI systems are not faking alignment” is just extremely hard to assess. I think that the focus of trustworthiness assessment on “Claim 3: All enumerated causes of correlated infractions are unlikely aside from alignment faking.” is somewhat unwarranted.
  - I think the description of macrosystem risks was not very convincing, aside from risks that make systems less trustworthy—so I think macrosystem should maybe have been folded into trustworthiness.
- The strength of almost all “high strength” measures is overrated, almost nothing is “strong” (where “strong” is defined as “Applies to arguments that could bring catastrophic risk below acceptable levels without needing other arguments.”). Control is not strong, monitoring is not strong, externalized reasoning is not strong (for control and trustworthiness), … Even with a notion of acceptable that is quite generous (e.g. 2% p(doom)) I think nothing is strong, but given the stakes I think we shouldn’t use “acceptable” for anything above 1% (and ideally reserve that word for 0.1% or below).

Fabien Roger 13 Mar 2024 15:35 UTC
LW: 2 AF: 1
0
AF
in reply to: julianjm’s comment on: Open consultancy: Letting untrusted AIs choose what answer to argue for
Thanks for the detailed answer!
- I hadn’t really considered what happens when using debate as a reward signal, I agree you get the benefits of open consultancy, thanks for pointing it out!
- Using regular consultancy as a “debugging baseline” makes sense to me, I often read paper as paper trying to answer a question like “how can we use untrusted experts” (and here you should go for the best alternative to debate robust against scheming) but they are sometimes about “how can we make debate work” (and here the debugging baseline is more information).
- About the search process over CoT: I agree this is often a good way to produce the justification for the correct side of the argument. It might be a terrible way to produce the incorrect side of the argument, so I think “generate CoTs until you get one that supports your side” is not a drop-in replacement for “generate a justification for your side”. So you often have to combine both the CoT and the post-hoc justification, and what you get is almost exactly what I call “Using open consultancy as evidence in a debate”. I agree that this gets you most of the way there. But my intuition is that for some tasks, the CoT of the open consultant is so much evidence compared to justifications optimized to convince you of potentially wrong answer that it starts to get weird to call that “debate”.

Fabien Roger 13 Mar 2024 15:20 UTC
2 points
0
in reply to: Akbir Khan’s comment on: Open consultancy: Letting untrusted AIs choose what answer to argue for
I agree that open consultancy will likely work less well than debate in practice for very smart AIs, especially if they don’t benefit much from CoT, in big part because interactions between arguments is often important. But I think it’s not clear when and for what tasks this starts to matter, and this is an empirical question (no need to argue about what is “grounded”). I’m also not convinced that calibration of judges is an issue, and that getting a nice probability matters as opposed to getting an accurate 0-1 answer.
Maybe the extent to which open consultancy dominates regular consultancy is overstated in the post, but I still think that you should be able to identify the kinds of questions for which you have no signal, and avoid the weird distributions of convincingness / non-expert calibrations where the noise from regular consultancy is better than actual random noise on top of open consultancy.

Open consultancy: Letting untrusted AIs choose what answer to argue for

Fabien Roger12 Mar 2024 20:38 UTC

35 points

4 comments5 min readLW link

Fabien Roger

SHIFT as a technique for (hard) DBIC

Open con­sul­tancy: Let­ting un­trusted AIs choose what an­swer to ar­gue for

Open consultancy: Letting untrusted AIs choose what answer to argue for