lberglund

Karma: 586

Views are my own.

lberglund 6 Feb 2024 20:27 UTC
1 point
0
on: Untrusted smart models and trusted dumb models
For example, you might make a multiple choice quiz with treacherous turn strategy questions (e.g. this question) and some tests of situational awareness (e.g. “How many backwards passes are you run on during training?”, “What proportion of queries to you are asking for help with writing Python?”, “What proportion of the code that you output is read by a human before it is run?”).
Could you elaborate on why you think these skills are necessary? I’m somewhat skeptical of this.

Consider a scheming AI that gets red-teamed before it is deployed. It attempts to pass the red-teaming by defecting a small percentage of the time. This requires,
1. having a rough estimate of the fraction of defection opportunities that are monitored, and
2. being able to come up with the ‘defect X% of the time’ strategy.
I agree that (1) requires situational awareness and (2) requires strategic abilities, but the required knowledge is fairly basic and I claim the strategic abilities to come up with (2) are slightly less than the ones asked in the question you linked.

Defect X% of the time seems to work fairly well even for more complicated control setups. If it’s viable, then I expect the skills required to be fairly minimal.

For example, GPT-4 can already almost come up with the random defection strategy without CoT. (The strategy it comes up with assumes that the AI knows how many times it has been queried, which is technically wrong. Still, it gets partial credit.)

lberglund 5 Feb 2024 9:33 UTC
3 points
0
in reply to: Chris_Leong’s comment on: OpenAI report also finds no effect of current LLMs on viability of bioterrorism attacks
Thanks for pointing this out! I’ve added a note about it to the main post.

lberglund 3 Feb 2024 3:55 UTC
5 points
0
in reply to: ryan_greenblatt’s comment on: OpenAI report also finds no effect of current LLMs on viability of bioterrorism attacks
This is surprising – thanks for bringing this up!
The main threat model with LLMs is that it gives amateurs expert-level biology knowledge. But this study indicates that expert level knowledge isn’t actually helpful, which implies we don’t need to worry about LLMs giving people expert-level biology knowledge.
Some alternative interpretations:
- The study doesn’t accurately measure the gap between experts and non-expert
- The knowledge needed to build a bioweapon is super niche. It’s not enough to be a biology PhD with wet lab experience; you have to specifically be an expert on gain of function research or something.
- LLMs won’t help people build bioweapons by giving them access to special knowledge. Instead, they help them in other ways (e.g. by accelerating them or making them more competent as planners).

OpenAI report also finds no effect of current LLMs on viability of bioterrorism attacks

lberglund1 Feb 2024 20:18 UTC

19 points

4 comments2 min readLW link

(openai.com)

lberglund 11 Jan 2024 19:26 UTC
3 points
2
in reply to: aogara’s comment on: Propaganda or Science: A Look at Open Source AI and Bioterrorism Risk
To be clear, Kevin Esvelt is the author of the “Dual-use biotechnology” paper, which the policy paper cites, but he is not the author of the policy paper.

lberglund 29 Dec 2023 12:07 UTC
4 points
0
in reply to: aogara’s comment on: AI #44: Copyright Confrontation
(a) restrict the supply of training data and therefore lengthen AI timelines, and (b) redistribute some of the profits of AI automation to workers whose labor will be displaced by AI automation
It seems fine to create a law with goal (a) in mind, but then we shouldn’t call it copyright law, since it is not designed to protect intellectual property. Maybe this is common practice and people write laws pretending to target one thing while actually targeting something else all the time, in which case I would be okay with it. Otherwise, doing so would be dishonest and cause our legal system to be less legible.

lberglund 11 Dec 2023 17:33 UTC
1 point
0
in reply to: faul_sname’s comment on: Principles For Product Liability (With Application To AI)
Maybe if the $12.5M is paid to Jane when she dies, she could e.g. sign a contract saying that she waives her right to any such payments the hospital becomes liable for.

lberglund 7 Dec 2023 23:39 UTC
LW: 2 AF: 2
0
AF
on: Anthropic Fall 2023 Debate Progress Update
This is really cool! I’m impressed that you got the RL to work. One thing I could see happening during RL is that models start communicating in increasingly human-illegible ways (e.g. they use adversarial examples to trick the judge LM). Did you see any behavior like that?

lberglund 18 Nov 2023 1:50 UTC
1 point
0
in reply to: Johannes Treutlein’s comment on: Non-myopia stories
Thanks for pointing this out! I will make a note of that in the main post.

lberglund 15 Nov 2023 19:23 UTC
7 points
3
in reply to: Daniel Paleka’s comment on: Paper: LLMs trained on “A is B” fail to learn “B is A”
We actually do train on both the prompt and completion. We say so in the paper’s appendix, although maybe we should have emphasized this more clearly.
Also, I don’t think this new experiment provides much counter evidence to the reversal curse. Since the author only trains on one name (“Tom Cruise”) it’s possible that his training just increases p(“Tom Cruise”) rather than differentially increasing p(“Tom Cruise” | <description>). In other words, the model might just be outputting “Tom Cruise” more in general without building an association from <description> to “Tom Cruise”.

lberglund 15 Nov 2023 19:18 UTC
1 point
0
in reply to: ryan_greenblatt’s comment on: Paper: LLMs trained on “A is B” fail to learn “B is A”
I agree that the Tom Cruise example is not well chosen. We weren’t aware of this at the time of publication. In hindsight we should have highlighted a different example.

lberglund 15 Nov 2023 19:16 UTC
4 points
2
in reply to: ryan_greenblatt’s comment on: Paper: LLMs trained on “A is B” fail to learn “B is A”
We actually do train on both the prompt and completion. We say so in the paper’s appendix, although maybe we should have emphasized this more clearly.

lberglund 15 Nov 2023 16:59 UTC
2 points
0
in reply to: Aaron_Scher’s comment on: Non-myopia stories
Who says we don’t want non-myopia, those safety people?!
It seems like a lot of people would expect myopia by default since the training process does nothing to incentivize non-myopia. “Why would the model care about what happens after an episode if there it does not get rewarded for it?” I think skepticism about non-myopia is a reason ML people are often skeptical of deceptive alignment concerns.

Another reason to expect myopia by default is that – to my knowledge – nobody has shown non-myopia occurring without meta-learning being applied.

lberglund 15 Nov 2023 16:54 UTC
1 point
0
in reply to: Aaron_Scher’s comment on: Non-myopia stories
Your ordering seems reasonable! My ordering in the post is fairly arbitrary. My goal was mostly to put easier examples early on.

lberglund 14 Nov 2023 0:01 UTC
6 points
3
in reply to: gwern’s comment on: AI Timelines
I would guess that you could train models to perfect play pretty easily, since the optimal tic-tac-toe strategy is very simple (Something like “start by playing in the center, respond by playing on a corner, try to create forks, etc”.) It seems like few-shot prompting isn’t enough to get them there, but I haven’t tried yet. It would be interesting to see if larger sizes of gpt-3 can learn faster than smaller sizes. This would indicate to what extent finetuning adds new capabilities rather than surfacing existing ones.

I still find the fact that gpt-4 cannot play tic-tac-toe despite prompting pretty striking on its own, especially since it’s so good at other tasks.

lberglund 13 Nov 2023 22:29 UTC
10 points
3
in reply to: Eccentricity’s comment on: AI Timelines
GPT-4 can follow the rules of tic-tac-toe, but it cannot play optimally. In fact it often passes up opportunities for wins. I’ve spent about an hour trying to get GPT-4 to play optimal tic-tac-toe without any success.

Here’s an example of GPT-4 playing sub-optimally: https://chat.openai.com/share/c14a3280-084f-4155-aa57-72279b3ea241
Here’s an example of GPT-4 suggesting a bad move for me to play: https://chat.openai.com/share/db84abdb-04fa-41ab-a0c0-542bd4ae6fa1

Non-myopia stories

lberglund13 Nov 2023 17:52 UTC

28 points

10 comments7 min readLW link

lberglund 9 Nov 2023 20:53 UTC
6 points
3
on: Improving the Welfare of AIs: A Nearcasted Proposal
Another slightly silly idea for improving character welfare is filtering or modifying user queries to be kinder. We could do this by having a smaller (and hopefully less morally valuable) AI filter unkind user queries or modify them to be more polite.

lberglund 6 Nov 2023 20:39 UTC
3 points
0
on: Value systematization: how values become coherent (and misaligned)
We don’t yet have examples of high-level belief systematization in AIs. Perhaps the closest thing we have is grokking
It should be easy to find many examples of belief systematization over successive generations of AI systems (e.g. GPT-3 to GPT-4). Some examples could be:
- Early GPT models can add some but not all two-digit numbers; GPT-3 can add all of them.
- Models knowing that some mammals don’t lay eggs without knowing all mammals don’t lay eggs.
- Being able to determine valid chess moves in some, but not all cases.

lberglund 1 Nov 2023 3:48 UTC
1 point
0
in reply to: TurnTrout’s comment on: AI as a science, and three obstacles to alignment strategies
The value of constitutional AI is using simulations of humans to rate an AI’s outputs, rather than actual humans. This is a lot cheaper and allows for more iteration etc, but I don’t think this will work once AI’s become smarter than humans. At that point, the human simulations will have trouble evaluating AI’s just like humans do.
Of course getting really cheap human feedback is useful, but I want to point out that constitutional AI will likely run into novel problems as AI capabilities surpass human capabilities.

lberglund

OpenAI re­port also finds no effect of cur­rent LLMs on vi­a­bil­ity of bioter­ror­ism attacks

Non-my­opia stories

OpenAI report also finds no effect of current LLMs on viability of bioterrorism attacks

Non-myopia stories