Thomas Kwa

Karma: 4,566

Was on Vivek Hebbar’s team at MIRI, now working with Adrià Garriga-Alonso on various empirical alignment projects.

I’m looking for projects in interpretability, activation engineering, and control/oversight; DM me if you’re interested in working with me.

I have signed no contracts or agreements whose existence I cannot mention.

Thomas Kwa 26 Jul 2024 0:20 UTC
2 points
0
in reply to: Remmelt’s comment on: The case for stopping AI safety research
I think your first sentence is actually compatible with my view. If GPT-7 is very dangerous and OpenAI claims they can use some specific set of safety techniques to make it safe, I agree that the burden of proof is on them. But I also think the history of technology should make you expect on priors that the kind of safety research intended to solve actual safety problems (rather than safetywash) is net positive.
I don’t think it’s worth getting into why, but briefly it seems like the problems studied by many researchers are easier versions of problems that would make a big dent in alignment. For example, Evan wants to ultimately get to level 7 interpretability, which is just a harder version of levels 1-5.
I have not really thought about the other side—making models more usable enables more scaling (as distinct from the argument that understanding gained from interpretability is useful for capabilities) but it mostly seems confined to specific work done by labs that is pointed at usability rather than safety. Maybe you could randomly pick two MATS writeups from 2024 and argue that the usability impact makes them net harmful.

Thomas Kwa 25 Jul 2024 23:32 UTC
51 points
9
on: “AI achieves silver-medal standard solving International Mathematical Olympiad problems”
In early 2022 Paul Christiano and Eliezer Yudkowsky publicly stated a bet: Paul thought an IMO gold medal was 8% by 2025, and Eliezer >16%. Paul said “If Eliezer wins, he gets 1 bit of epistemic credit.” I’m not sure how to do the exact calculation, but based on the current market price of 69% Eliezer is probably expected to win over half a bit of epistemic credit.
If the news holds up we should update partway to the takeaways Paul virtuously laid out in the post, to the extent we haven’t already.
How I’d update
The informative:
- I think the IMO challenge would be significant direct evidence that powerful AI would be sooner, or at least would be technologically possible sooner. I think this would be fairly significant evidence, perhaps pushing my 2040 TAI probability up from 25% to 40% or something like that.
- I think this would be significant evidence that takeoff will be limited by sociological facts and engineering effort rather than a slow march of smooth ML scaling. Maybe I’d move from a 30% chance of hard takeoff to a 50% chance of hard takeoff.
- If Eliezer wins, he gets 1 bit of epistemic credit.^[2]^[3] These kinds of updates are slow going, and it would be better if we had a bigger portfolio of bets, but I’ll take what we can get.
- This would be some update for Eliezer’s view that “the future is hard to predict.” I think we have clear enough pictures of the future that we have the right to be surprised by an IMO challenge win; if I’m wrong about that then it’s general evidence my error bars are too narrow.

Thomas Kwa 23 Jul 2024 23:25 UTC
2 points
0
in reply to: DanArmak’s comment on: News : Biden-⁠Harris Administration Secures Voluntary Commitments from Leading Artificial Intelligence Companies to Manage the Risks Posed by AI
A more recent paper shows that an equally strong model is not needed to break watermarks though paraphrasing. It suffices to have a quality oracle and a model that achieves equal quality with positive probability.

Thomas Kwa 1 Jul 2024 23:08 UTC
8 points
0
in reply to: Kaj_Sotala’s comment on: Daniel Kokotajlo’s Shortform
Vertical takeoff aircraft require a far more powerful engine than a helicopter to lift the aircraft at a given vertical speed because they are shooting high velocity jets of air downwards. An engine “sufficiently powerful” for a helicopter would not be sufficient for VTOL.

Thomas Kwa 17 Jun 2024 7:13 UTC
7 points
2
in reply to: TsviBT’s comment on: TsviBT’s Shortform
This argument does not seem clear enough to engage with or analyze, especially steps 2 and 3. I agree that concepts like reflective stability have been confusing, which is why it is important to develop them in a grounded way.

Thomas Kwa 15 Jun 2024 7:47 UTC
LW: 13 AF: 3
0
AF
in reply to: Richard_Ngo’s comment on: ricraz’s Shortform
How can the mistakes rationalists are making be expressed in the language of Bayesian rationalism? Priors, evidence, and posteriors are fundamental to how probability works.

Thomas Kwa 13 Jun 2024 21:22 UTC
13 points
0
on: [Paper] AI Sandbagging: Language Models can Strategically Underperform on Evaluations
Relevant: Redwood found that fine-tuning and RL are both capable of restoring the full performance of sandbagging (password-locked) models created using fine-tuning.

Thomas Kwa 13 Jun 2024 21:13 UTC
6 points
0
in reply to: Jeremy Gillen’s comment on: Thomas Kwa’s Shortform
Is this even possible? Flexibility/generality seems quite difficult to get if you also want the long-range effects of the agent’s actions, as at some point you’re just solving the halting problem. Imagine that the agent and environment together are some arbitrary Turing machine and halting gives low reward. Then we cannot tell in general if it eventually halts. It also seems like we cannot tell in practice whether complicated machines halt within a billion steps without simulation or complicated static analysis?

Thomas Kwa 12 Jun 2024 23:20 UTC
4 points
0
in reply to: evhub’s comment on: Thomas Kwa’s Shortform
Do you think we could easily test this without having a deceptive model lying around? I could see us having level 5 and testing it in experimental setups like the sleeper agents paper, but being unconfident that our interpretability would actually work against a deceptive model. This seems analogous to red-teaming failure in AI control, but much harder because the models could very easily have ways we don’t think of to hide its cognition internally.

Thomas Kwa 12 Jun 2024 18:05 UTC
43 points
0
on: Thomas Kwa’s Shortform
People with p(doom) > 50%: would any concrete empirical achievements on current or near-future models bring your p(doom) under 25%?
Answers could be anything from “the steering vector for corrigibility generalizes surprisingly far” to “we completely reverse-engineer GPT4 and build a trillion-parameter GOFAI without any deep learning”.

Thomas Kwa 11 Jun 2024 4:14 UTC
10 points
0
in reply to: quila’s comment on: 0. CAST: Corrigibility as Singular Target
I am not and was not a MIRI researcher on the main agenda, but I’m closer than 98% of LW readers, so you could read my critique of part 1 here if you’re interested. I also will maybe reflect on other parts.

Thomas Kwa 11 Jun 2024 4:13 UTC
LW: 5 AF: 4
−1
AF
on: 1. The CAST Strategy
I am pro-corrigibility in general but there are parts of this post I think are unclear, not rigorous enough to make sense to me, or I disagree with. Hopefully this is a helpful critique, and maybe parts get answered in future posts.
On definitions of corrigiblity
You give an informal definition of “corrigible” as (C1):
an agent that robustly and cautiously reflects on itself as a flawed tool and focusing on empowering the principal to fix its flaws and mistakes.
I have some basic questions about this.
- Empowering the principal to fix its flaws and mistakes how? Making it closer to some perfectly corrigible agent? But there seems to be an issue here:
  - If the “perfectly corrigible agent” it something that only reflects on itself and tries to empower the principal to fix it, it would be useless at anything else, like curing cancer.
  - If the “perfectly corrigible agent” can do other things as well, there is a huge space of other misaligned goals it could have that it wouldn’t want to remove.
- Why would an agent whose *only* terminal/top-level goal is corrigibility gather a Minecraft apple when humans ask it to? It seems like a corrigible agent would have no incentive to do so, unless it’s some galaxy-brained thing like “if I gather the Minecraft apple, this will move the corrigibility research project forward because it meets humans’ expectations of what a corrigible agent does, which will give me more power and let me tell the humans how to make me more corrigible”.
- Later, you say “A corrigible agent will, if the principal wants its values to change, seek to be modified to reflect those new values.”
  - I do not see how C1 implies this, so this seems like a different aspect of corrigibility to me.
  - “reflect those new values” seems underspecified as it is unclear how a corrigible agent reflects values. Is it optimizing a utility function represented by the values? How does this trade off against corrigibility?
Other comments:
- In “What Makes Corrigibility Special”, where you use the metaphor of goals as two-dimensional energy landscape, it is not clear what type of goals are being considered.
  - Are these utility functions over world-states? If so, corrigibility cannot AFAIK be easily expressed as one, and so doesn’t really fit into the picture.
  - If not, it’s not clear to me why most of this space is flat: agents are embedded and many things we do in service of goals will change us in ways that don’t conflict with our existing goals, including developing. E.g. if I have the goal of graduating college I will meet people along the way and perhaps gain the goal of being president of the math club, a liberal political bent, etc.
- In “Contra Impure or Emergent Corrigibility”, Paul isn’t saying the safety benefits of act-based agents come mainly from corrigibility. Act-based agents are safer because they do not have long-range goals that could produce dangerous instrumental behavior.
Comments on cruxes/counterpoints
- Solving Anti-Naturality at the Architectural Layer
  - In my ontology it is unclear how you solve “anti-naturality” at the architectural layer, if what you mean by “anti-naturality” is that the heuristics and problem-solving techniques that make minds capable of consequentialist goals tend to make them preserve their own goals. If the agent is flexibly thinking about how to build a nanofactory and naturally comes upon the instrumental goal of escaping so that no one can alter its weights, what does it matter whether it’s a GOFAI, Constitutional AI agent, OmegaZero RL agent or anything else?
- “General Intelligence Demands Consequentialism”
  - Agree
- Desiderata Lists vs Single Unifying Principle
  - I am pro desiderata lists because all of the desiderata bound the badness of an AI’s actions and protect against failure modes in various ways. If I have not yet found that corrigibility is some mathematically clean concept I can robustly train into an AI, I would prefer the agent be shutdownable in addition to “hard problem of corrigibility” corrigible, because what if I get the target wrong and the agent is about to do something bad? My end goal is not to make the AI corrigible, it’s to get good outcomes. You agree with shutdownability but I think this also applies to other desiderata like low impact. What if the AI kills my parents because for some weird reason this makes it more corrigible?
What links here?
- Thomas Kwa's comment on 0. CAST: Corrigibility as Singular Target by Max Harms (11 Jun 2024 4:14 UTC; 10 points)

Thomas Kwa 11 Jun 2024 1:12 UTC
4 points
0
on: Corrigibility could make things worse
It seems to me that corrigibility doesn’t make things worse in this example, it’s just that a partially corrigible AI could still lead to bad outcomes. In fact one could say that the AI in the example is not corrigible enough, because it exerts influence in ways we don’t want.

Thomas Kwa 9 Jun 2024 4:00 UTC
7 points
12
in reply to: Nate Showell’s comment on: Two easy things that maybe Just Work to improve AI discourse
I don’t anticipate being personally affected by this much if I start using Twitter.

Thomas Kwa 9 Jun 2024 1:50 UTC
1 point
0
in reply to: Tamsin Leake’s comment on: Non-Disparagement Canaries for OpenAI
I care about my wealth post-singularity and would be wiling to make bets consistent with this preference, e.g. I pay 1 share of QQQ now, you pay me 3 shares of QQQ 6 months after the world GDP has 10xed if we are not all dead then.

Thomas Kwa 7 Jun 2024 5:23 UTC
2 points
0
in reply to: yanni kyriacos’s comment on: yanni’s Shortform
- Jane at FakeLab has a background in interpretability but is currently wrangling data / writing internal tooling / doing some product thing because the company needs her to, because otherwise FakeLab would have no product and be unable to continue operating including its safety research. Steve has comparative advantage at Jane’s current job.
- It seems net bad because the good effect of slowing down OpenAI is smaller than the bad effect of GM racing? But OpenAI is probably slowed down—they were already trying to build AGI and they have less money and possibly less talent. Thinking about the net effect is complicated and I don’t have time to do it here. The situation with joining a lab rather than founding one may also be different.

Thomas Kwa 6 Jun 2024 21:07 UTC
−3 points
−6
in reply to: yanni kyriacos’s comment on: yanni’s Shortform
Someone I know who works at Anthropic, not on alignment, has thought pretty hard about this and concluded it was better than alternatives. Some factors include
- by working on capabilities, you free up others for alignment work who were previously doing capabilities but would prefer alignment
- more competition on product decreases aggregate profits of scaling labs
At one point some kind of post was planned but I’m not sure if this is still happening.
I also think there are significant upskilling benefits to working on capabilities, though I believe this less than I did the other day.

Thomas Kwa 6 Jun 2024 20:58 UTC
5 points
0
in reply to: habryka’s comment on: What do coherence arguments actually prove about agentic behavior?
I do think they disagree based on my experience working with Nate and Vivek. Eliezer has said he has only shared 40% of his models with even Nate for infosec reasons [1] (which surprised me!), so it isn’t surprising to me that they would have different views. Though I don’t know Eliezer well, I think he does believe in the basic point of Deep Deceptiveness (because it’s pretty basic) but also believes in coherence/utility functions more than Nate does. I can maybe say more privately but if it’s important asking one of them is better.

[1] This was a while ago so he might have actually said that Nate only has 40% of his models. But either way my conclusion is valid.

Thomas Kwa 6 Jun 2024 18:11 UTC
7 points
3
on: What do coherence arguments actually prove about agentic behavior?
I think that it’s mostly Eliezer who believes so strongly in utility functions. Nate Soares’ post Deep Deceptiveness, which I claim is a central part of the MIRI threat model insofar as there is one, doesn’t require an agent coherent enough to satisfy VNM over world-states. In fact it can depart from coherence in several ways and still be capable and dangerous:
- It can flinch away from dangerous thoughts;
- Its goals can drift over time;
- Its preferences can be incomplete;
- Maybe every 100 seconds it randomly gets distracted for 10 seconds
The important property it has is a goal about the real world and applies general problem-solving skills to achieve it, and has no stable desire to use its full intelligence to be helpful/good for humans. No one has formalized this and so no one has proved interesting things about such an agent model.

Thomas Kwa 6 Jun 2024 5:37 UTC
4 points
4
in reply to: RobertM’s comment on: RobertM’s Shortform
Such a world could even be more dangerous. LLMs are steerable and relatively weak at consequentialist planning. There is AFAICT no fundamental reason why the next paradigm couldn’t be even less interpretable, less steerable, and more capable of dangerous optimization at a given level of economic utility.

Thomas Kwa

How I’d update

On definitions of corrigiblity

Other comments:

Comments on cruxes/​counterpoints

Comments on cruxes/counterpoints