Thomas Kwa

Karma: 3,988

Was on Vivek Hebbar’s team at MIRI, now working with Adrià Garriga-Alonso on various empirical alignment projects.

I’m looking for projects in interpretability, activation engineering, and control/oversight; DM me if you’re interested in working with me.

Thomas Kwa’s MIRI research experience

Thomas Kwa, peterbarnett, Vivek Hebbar, Jeremy Gillen, jacobjacob and Raemon

2 Oct 2023 16:42 UTC

169 points

52 comments1 min readLW link

Thomas Kwa 3 Dec 2023 4:57 UTC
140 points
112
on: Quick takes on “AI is easy to control”
I want to see a dialogue happen between someone with Nate’s beliefs and someone with Nora’s beliefs. The career decisions of hundreds of people, including myself, depend on clearly thinking through the arguments behind various threat models. I find it pretty embarrassing for the field that there is mutual contempt between people who disagree the most, when such a severe disagreement means the greatest opportunity to understand basic dynamics behind AGI.
Sure, communication is hard sometimes so maybe the dialogue is infeasible, and in fact I can’t think of any particular people I’d want to do this. It still makes me sad.

Thomas Kwa 6 Mar 2024 21:13 UTC
90 points
22
on: Thomas Kwa’s Shortform
Air purifiers are highly suboptimal and could be >2.5x better.
Some things I learned while researching air purifiers for my house, to reduce COVID risk during jam nights.
- An air purifier is simply a fan blowing through a filter, delivering a certain CFM (airflow in cubic feet per minute). The higher the filter resistance and lower the filter area, the more pressure your fan needs to be designed for, and the more noise it produces.
- HEPA filters are inferior to MERV 13-14 filters except for a few applications like cleanrooms. The technical advantage of HEPA filters is filtering out 99.97% of particles of any size, but this doesn’t matter when MERV 13-14 filters can filter 77-88% of infectious aerosol particles at much higher airflow. The correct metric is CADR (clean air delivery rate), equal to airflow * efficiency. [1]
- Commercial air purifiers use HEPA filters for marketing reasons and to sell proprietary filters. But an even larger flaw is that they have very small filter areas for no apparent reason. Therefore they are forced to use very high pressure fans, dramatically increasing noise.
- Originally people devised the Corsi-Rosenthal Box to maximize CADR. They’re cheap but rather loud and ugly, but later designs have fixed this.
- (85% confidence) Wirecutter recommendations (Coway AP-1512HH) have been beat by ~2.5x in CADR/$, CADR/noise, CADR/floor area, and CADR/watt at a given noise level, just by having higher filter area; the better purifiers are about 2.5x better at their jobs. [2]
  - At noise levels acceptable for a living room (~40 dB, Wirecutter’s top pick on medium), CleanAirKits and Nukit sell purifier kits that use PC fans to push air through commercial MERV filters, getting 2.5x CADR at the same noise level, footprint, and energy usage [3]. These are basically handmade but still achieve cost parity with used Coways, 2.5x CADR/$ against new Coways, and use cheaper filters.
  - At higher noise levels (Wirecutter’s top pick on high), there are kits and DIY options meant for garages and workshops that beat Wirecutter in cost too.
- However, there exist even better designs that no one is making.
  - jefftk devised a ceiling fan air purifier which is extremely quiet.
  - Someone on Twitter made a wall-mounted prototype with PC fans that blocks fan noise, reducing noise by another few dB and reducing the space requirement to near zero. If this were mass-produced flat-pack furniture (and had a few more fans), it would likely deliver ~300 CFM CADR (2.7x Wirecutter’s top pick on medium, enough to meet ASHRAE 241 standards for infection control for 6 people in a residential common area or 9 in an office), be really cheap, and generally be unobtrusive enough in noise, space, and aesthetics to be run ²⁴⁄₇.
  - A seller on Taobao makes PC fan kits for much less than cleanairkits (reddit discussion). One model is sold on Amazon for a big markup, but it’s not the best model, takes 4-7 weeks to ship, is often out of stock, and don’t ship to CA where I live. If their taller (higher area) model shipped to CA I would get it over the cleanairkits one.
  - V-bank filters should have ~3x higher filter area for a given footprint, further increasing CADR by maybe 1.7x.
- If I’m right, the fact that these are not mass-produced is a major civilizational failing.
[1] For large rooms, another concern is getting air to circulate properly.
[2] Most commercially available air purifiers have even worse CADR/$ or noise than the Wirecutter picks.
[3] The Wirecutter top pick was tested at 110 CFM on medium; the CleanAirKits Luggable XL was tested at 323 CFM at around the same noise level (not sure of exact values as measurements differ, but the Luggable is likely quieter) and footprint with slightly higher power usage.
What links here?
- Thomas Kwa's comment on Ceiling Air Purifier by jefftk (6 Mar 2024 22:31 UTC; 4 points)

Jetlag, Nausea, and Diarrhea are Largely Optional

Thomas Kwa21 Mar 2022 22:40 UTC

89 points

32 comments2 min readLW link

The case for turning glowfic into Sequences

Thomas Kwa27 Apr 2022 6:58 UTC

86 points

28 comments5 min readLW link

Thomas Kwa’s research journal

Thomas Kwa and Adrià Garriga-alonso

23 Nov 2023 5:11 UTC

79 points

1 comment6 min readLW link

Thomas Kwa 28 Dec 2022 5:09 UTC
69 points
60
on: What AI Safety Materials Do ML Researchers Find Compelling?
Here’s one factor that might push against the value of Steinhardt’s post as something to send to ML researchers: perhaps it is not arguing for anything controversial, and so is easier to defend convincingly. Steinhardt doesn’t explicitly make any claim about the possibility of existential risk, and barely mentions alignment. Gates spends the entire talk on alignment and existential risk, and might avoid being too speculative because their talk is about a survey of basically the same ML researcher population as the audience, and so can engage with the most important concerns, counterarguments, etc.
I’d guess that the typical ML researcher who reads Steinhardt’s blogposts will basically go on with their career unaffected, whereas one that watches the Gates video will now know that alignment is a real subfield of AI research, plus the basic arguments for catastrophic failure modes. Maybe they’ll even be on the lookout for impressive research from the alignment field, or empirical demonstrations of alignment problems.
Caveats: I’m far from a NeurIPS author and spent 10 minutes skipping through the video, so maybe all of this is wrong. Would love to see evidence one way or the other.
What links here?
- aogara's comment on What AI Safety Materials Do ML Researchers Find Compelling? by Vael Gates (EA Forum; 28 Dec 2022 15:33 UTC; 23 points)

My experience with the “rationalist uncanny valley”

Thomas Kwa23 Apr 2020 20:27 UTC

66 points

18 comments5 min readLW link

Thomas Kwa 18 Apr 2023 21:54 UTC
66 points
3
on: grey goo is unlikely
Not an expert in chemistry or biochemistry, but this post seems to basically not engage with the feasibility studies Drexler has made in Nanosystems, and makes a bunch of assertions without justification, including where Nanosystems has counterarguments. I wish more commenters would engage on the object level because I really don’t have the background to, and even I see a bunch of objections. Nevertheless I’ll make an attempt. I encourage OP and others to correct me where I am ignorant of some established science.
Points 1, 2, 3, 4 are not relevant to Drexlerian nanotech and seem like reasonable points for other paradigms.
Regarding 5, my understanding is that mechanosynthesis involves precise placement of individual atoms according to blueprints, thus making catalysts that selectively bind to particular molecules unnecessary.
6. no liquid
Any self-replicating nanobot must have many internal components. If the interior is not filled with water, those components will clump together and be unable to move around, because electrostatic & dispersion interactions are proportionately much stronger on a small scale. The same is true to a lesser extent for the nanobots themselves.
Vacuum is even worse. Any self-replicating cell must move material between outside and multiple compartments. Gas leakage by the transporters would be inevitable. Cellular vacuum pumps would require too much energy and may be impossible. Also, strongly binding the compounds used (eg CO2) to carriers at every step would require too much energy. (“Too much energy” means too much to be competitive with normal biological processes.)
- If your components are all fixed in place by covalent bonds, they can’t clump together.
- Nanosystems section 11.4 gives arguments against gas leakage being inevitable, and a proposed “turbomolecular pump” design.
- It’s not clear why a cellular vacuum pump must have a low efficiency, and the actual work (P*V) that must be done by a vacuum pump should be well below the kcal/gram range that biological cells need to replicate.
7 is not relevant given that I’m imagining hard vacuum systems.
8. high temperatures

[...] Enzymes need to be able to change shape somewhat. Without conformational changes, enzymes can’t grab their substrate well enough. Without conformational changes, there’s no way to drive an unfavorable reaction with a favorable reaction, and that’s necessary.
Because enzymes must be able to do conformational changes, they need to have some strong interactions and some weaker interactions that can be broken or shifted. Those weaker interactions can’t hold molecules together at high temperatures. Some life can grow at 100 C but 200 C isn’t possible.
If the “enzymes” are not made of proteins, their weaker interactions can be stronger than the hydrogen bonds that hold proteins together. If they’re roughly twice as strong (say by using covalent bonds or twice as many hydrogen bonds), they wouldn’t denature even at 200C.
Also, there is no way to later remove carbon atoms from the diamond at low temperature. How, then, would a nanobot with a diamond shell replicate?
The shell could be made of diamond panels with airtight joints. The daughter cell’s internal components and membrane are manufactured inside the parent cell, then the membrane is added to the parent cell’s membrane, it unfolds in an origami fashion into two membranes of original size, then the daughter cell separates.
This is a pretty obvious idea to me, so unless there’s some obvious reason why something like this doesn’t work, I get the feeling that the post isn’t engaging with the strongest arguments.
10-13 seem pretty reasonable to me.
14. positional nanoassembly
[...] Protein-sized position sensors don’t exist.
Molecular linear motors do exist, but 1 ATP (or other energy carrier) is needed for every step taken.
If you want to catalyze reactions, you need floppy enzymes. Even if you attach them to a rigid bed, they’ll flop all over the place. (On a microscopic scale, normal temperatures are like a macroscopic 3d printer being shaken violently.)
It seems to me that the “step” for molecular linear motors could be an arbitrarily long distance. The moving part randomly moves about the stator, and there are ratchets every, say, 100 nm that let the moving part pass in one direction when an ATP is consumed. Then when it needs to be fixed in position, a different mechanism does that.
The “floppy enzymes” has the same solution as section 8. In chapter 13 of Nanosystems Drexler also gives three different ways this problem is solved, two of which involve molecular manipulators:
Aside from differences of scale and component properties, molecular manipulators differ from macroscale devices in that they must maintain positional accuracy despite thermal excitation. This problem can be minimized either (1) by operation at reduced temperatures, which receives no further attention here; or (2) by the use of a stiff mechanism, as described in Section 13.4.1:or (3) by use of local nonbonded contacts to align reagent devices to workpieces immediately before reaction, as discussed in Section 13.4.2.
My current impression is that this post is perfectly consistent with the problem of making a self-replicating diamondoid bacterium being easy enough that unamplified humans could do it in <500 years given good research practices, software, and perhaps some narrow AI tools like some future version of AlphaFold. It’s true that Drexlerian nanotech is not really optimized for being self-replicating nanomachines, as Drexler envisions it’s most useful for mass manufacturing and computing. They might be less evolvable, consume more energy, or be specialized to certain environments like the atmosphere or ocean, especially when designed by human-level intelligences. But there are also potentially huge advantages like being undigestible to biological life and viruses, being impervious to more forms of damage, or a wider range of metabolic pathways. But it requires a different argument to say that either (a) nanotech is impossible, or (b) the disadvantages of nanotech-based life forms outweigh the advantages.
If I were in the mindset I get from this post, I would have a hard time not asserting that powered flight would be impractical for airplanes larger than birds, or that modern semiconductor manufacturing were impossible due to the precision required. I would probably deny other possibilities currently thought extremely plausible, like fusion energy and immortality. Maybe the author isn’t making the same mistake, but I’m doubtful nonetheless.
What links here?
- Thomas Kwa's comment on grey goo is unlikely by bhauth (18 Apr 2023 22:03 UTC; 6 points)

Thomas Kwa 15 Jul 2023 7:10 UTC
65 points
45
on: Why was the AI Alignment community so unprepared for this moment?
I have some sympathy for this sentiment, but I want to point out that the alignment community was tiny until last year and still is small, so many opportunities that are becoming viable now were well below the bar earlier. If you were Rohin Shah, could you do better than finishing your PhD, publishing the Alignment Newsletter starting 2018 and then joining DeepMind in 2020? If you were Rob Miles, could you do better than the YouTube channel starting 2017? As Jason Matheny, could you do better than co-chairing the task force that wrote the US National AI Strategic Plan in 2016, then founding CSET in 2018? As Kelsey Piper, could you do better than writing for Vox starting 2018? Or should we have diverted some purely technical researchers, who are mostly computer science nerds with no particular talent at talking to people, to come up with a media outreach plan?
Keeping in mind that it’s not clear in advance what your policy asks are or what arguments you need to counter and so your plan would go stale every couple of years, and that for the last 15 years the economy, health care, etc. have been orders of magnitude more salient to the average person than AGI risk with no signs this would change as suddenly as it did with ChatGPT.

How dath ilan coordinates around solving alignment

Thomas Kwa13 Apr 2022 4:22 UTC

64 points

45 comments5 min readLW link

[Question] Will quantum randomness affect the 2028 election?

Thomas Kwa and habryka

24 Jan 2024 22:54 UTC

63 points

48 comments1 min readLW link

Thomas Kwa 25 Jan 2024 2:11 UTC
54 points
6
on: This might be the last AI Safety Camp
Maybe I’m being cynical, but I’d give >30% that funders have declined to fund AI Safety Camp in its current form for some good reason. Has anyone written the case against? I know that AISC used to be good by talking to various colleagues, but I have no particular reason to believe in its current quality.
- MATS has steadily increased in quality over the past two years, and is now more prestigious than AISC. We also have Astra, and people who go directly to residencies at OpenAI, Anthropic, etc. One should expect that AISC doesn’t attract the best talent.
  - If so, AISC might not make efficient use of mentor / PI time, which is a key goal of MATS and one of the reasons it’s been successful.
- Why does the founder, Remmelt Ellen, keep linkposting writing by Forrest Landry which I’m 90% sure is obvious crankery? It’s not just my opinion; Paul Christiano said “the entire scientific community would probably consider this writing to be crankery”, one post was so obviously flawed it gets −46 karma, and generally the community response has been extremely negative. Some AISC work is directly about the content in question. This seems like a concern especially given the philosophical/conceptual focus of AISC projects, and the historical difficulty in choosing useful AI alignment directions without empirical grounding. [Edit: To clarify, this is not meant to be a character attack. I am concerned that Remmelt does not have the skill of distinguishing crankery from good research, even if he has substantially contributed to AISC’s success in the past.]
- All but 2 of the papers listed on Manifund as coming from AISC projects are from 2021 or earlier. Because I’m interested in the current quality in the presence of competing programs, I looked at the two from 2022 or later: this in a second-tier journal and this in a NeurIPS workshop, with no top conference papers. I count 52 participants in the last AISC so this seems like a pretty poor rate, especially given that 2022 and 2023 cohorts (#7 and #8) could both have published by now. (though see this reply from Linda on why most of AISC’s impact is from upskilling)
- The impact assessment was commissioned by AISC, not independent. They also use the number of AI alignment researchers created as an important metric. But impact is heavy-tailed, so the better metric is value of total research produced. Because there seems to be little direct research, to estimate the impact we should count the research that AISC alums from the last two years go on to produce. Unfortunately I don’t have time to do this.
What links here?
- Thomas Kwa's comment on This might be the last AI Safety Camp by Remmelt (EA Forum; 25 Jan 2024 9:55 UTC; 91 points)
- Lucius Bushnaq's comment on This might be the last AI Safety Camp by Remmelt (EA Forum; 25 Jan 2024 23:45 UTC; 42 points)

Thomas Kwa 7 Oct 2023 6:22 UTC
52 points
33
on: Thomas Kwa’s MIRI research experience
I wrote this dialogue to stimulate object-level discussion on things like how infohazard policies slow down research, how new researchers can stop flailing around so much, whether deconfusion is a bottleneck to alignment, or the sharp left turn. I’m a bit sad that most of the comments are about exactly how much emotional damage Nate tends to cause people and whether this is normal/acceptable: it seems like a worthwhile conversation to happen somewhere, but now that many people have weighed in with experiences from other contexts, including non-research colleagues, a romantic partner, etc., I think all the drama distracts from the discussions I wanted to have. The LW team will be moving many comments to an escape-valve post; after this I’ll have a pretty low bar for downvoting comments I think are off-topic here.
What links here?
- Related Discussion from Thomas Kwa’s MIRI Research Experience by Raemon (7 Oct 2023 6:25 UTC; 71 points)

Thomas Kwa 15 May 2023 1:20 UTC
LW: 51 AF: 20
6
AF
in reply to: Gabriel Mukobi’s comment on: Steering GPT-2-XL by adding an activation vector
I think to solve alignment, we need to develop our toolbox of “getting AI systems to behave in ways we choose”. Not in the sense of being friendly or producing economic value, but things that push towards whatever cognitive properties we need for a future alignment solution. We can make AI systems do some things we want e.g. GPT-4 can answer questions with only words starting with “Q”, but we don’t know how it does this in terms of internal representations of concepts. Current systems are not well-characterized enough that we can predict what they do far OOD. No other work I’ve seen quite matches the promise this post has in finding ways to exert fine-grained control over a system’s internals; we now have a wide variety of concrete questions like
- how to find steering vectors for new behaviors e.g. speaking French?
- how to make these techniques more robust?
- What do steering vectors, especially multiple steering vectors, tell us about how the model combines concepts?
- Can we decompose the effect of a prompt into steering vectors from simpler prompts, thereby understanding why complex prompts work?
- Are the effects of steering vectors nonlinear for small coefficients? What does this mean about superposition?
- What’s the mechanism by which adding a steering vector with too large a coefficient breaks the model?
- Adding steering vectors at different layers surely means you are intervening at different “stages of processing”. What do the model’s internal concepts look like at different stages?
Comparing this to other work, my sense is that
- intervening on activations is better than training (including RLHF), because this builds towards understanding systems rather than steering a black box with a black-box reward model, and for the reasons the authors claim.
- Debate, although important, seems less likely to be a counterfactual, robust way to steer models. The original debate agenda ran into serious problems, and neither it nor the current Bowman agenda tells us much about the internals of models.
- steering a model with activation vectors is better than mechinterp (e.g. the IOI paper), because here you’ve proven you can make the AI do a wide variety of interesting things, plus mechinterp is slow
- I’m not up to date on the adversarial training literature (maybe academia has produced something more impressive), but I think this is more valuable than the Redwood paper, which didn’t have a clearly positive result. I’m glad people are working on adversarial robustness.
- steering the model using directions in activation space is more valuable than doing the same with weights, because in the future the consequences of cognition might be far-removed from its weights (deep deceptiveness)
It’s a judgement call whether this makes it the most impressive achievement, but I think this post is pretty clearly Pareto-optimal in a very promising direction. That said, I have a couple of reservations:
- By “most impressive concrete achievement” I don’t necessarily mean the largest single advance over SOTA. There have probably been bigger advances in the past (RLHF is a candidate), and the impact of ELK is currently unproven but will shoot to the top if mechanistic anomaly detection ever pans out.
- I don’t think we live in a world where you can just add a “be nice” vector to a nanotech-capable system and expect better consequences, again for deep deceptiveness-ish reasons. Therefore, we need advances in theory to convert our ability to make systems do things into true mastery of cognition.
- I don’t think we should call this “algebraic value editing” because it seems overly pretentious to say we’re editing the model’s values We don’t even know what values are! I don’t think RLHF is editing values, in the sense that it does something different from even the weak version of instilling desires to create diamonds, and this seems even less connected to values. The only connection is it’s modifying something contextually activated which is way too broad.
- It’s unclear that this works in a wide range of situations, or in the situations we need it to for future alignment techniques. The authors claim that cherry-picking was limited, but there are other uncertainties: when we need debaters that don’t collude to mislead the judge, will we be able to use activation patching? What if we need an AI that doesn’t self-modify to remove some alignment property?
What links here?
- Open problems in activation engineering by TurnTrout (24 Jul 2023 19:46 UTC; 43 points)

Thomas Kwa 4 Apr 2024 8:56 UTC
49 points
43
on: What’s with all the bans recently?
Another thing I’ve noticed is that almost all the users are trying.
I haven’t thought about whether these rate-limits are justified (I currently think at least ¹⁄₄ of them are unjustified and ¹⁄₂ are okay), but I want to point out that post/comment quality is real. That is, some users have higher quality comments than others (due to reasoning in the comment, combativeness, how often this leads to good discussion, etc.) often for illegible reasons, this substantially affects the value readers get, and this is predictive of their future content. It follows that if moderators want to reduce the incidence of low-quality content beyond what is caught by simple rules, then they cannot defend themselves perfectly against accusations of arbitrariness. The signal-to-noise ratio of LW is very important, and IMO this justifies mods making judgment calls.
Take MiguelDev, who posts extremely long posts consisting mostly of LLM output. My guess is that the experiments are mediocre due to lack of rigor, with a small possibility that they are good. They are not egregiously bad. But as evidenced by the low karma, few people get value from reading these extremely long posts. I would like to see much less of this content on the frontpage because it decreases the SNR; maybe three posts per year is okay. Therefore I’m fine with this user being rate-limited by moderator fiat to something like one post per month. If moderators started rate-limiting Nora Belrose or someone else whose work I thought was particularly good, they would lose my confidence, but this hasn’t happened yet.
I agree about providing explanations for bans or ratelimits that are functionally bans though.

The Box Spread Trick: Get rich slightly faster

Thomas Kwa1 Sep 2020 21:41 UTC

43 points

44 comments6 min readLW link

Thomas Kwa 4 Apr 2022 20:39 UTC
39 points
on: Call For Distillers
What posts of yours do you want distilled?

Thomas Kwa 8 Nov 2023 22:47 UTC
LW: 38 AF: 16
5
AF
on: Thomas Kwa’s Shortform
Eight beliefs I have about technical alignment research
Written up quickly; I might publish this as a frontpage post with a bit more effort.
1. Conceptual work on concepts like “agency”, “optimization”, “terminal values”, “abstractions”, “boundaries” is mostly intractable at the moment.
  - Success via “value alignment” alone— a system that understands human values, incorporates these into some terminal goal, and mostly maximizes for this goal, seems hard unless we’re in a very easy world because this involves several fucked concepts.
2. Whole brain emulation probably won’t happen in time because the brain is complicated and biology moves slower than CS, being bottlenecked by lab work.
3. Most progress will be made using simple techniques and create artifacts publishable in top journals (or would be if reviewers understood alignment as well as e.g. Richard Ngo).
4. The core story for success (>50%) goes something like:
  - Corrigibility can in practice be achieved by instilling various cognitive properties into an AI system, which are difficult but not impossible to maintain as your system gets pivotally capable.
  - These cognitive properties will be a mix of things from normal ML fields (safe RL), things that rhyme with normal ML fields (unlearning, faithfulness), and things that are currently conceptually fucked but may become tractable (low impact, no ontological drift).
  - A combination of oversight and these cognitive properties is sufficient to get useful cognitive work out of an AGI.
  - Good oversight complements corrigibility properties, because corrigibility both increases the power of your most capable trusted overseer and prevents your untrusted models from escaping.
5. Most end-to-end “alignment plans” are bad for three reasons: because research will be incremental and we need to adapt to future discoveries, because we need to achieve several things for AI to go well (no alignment magic bullet), and because to solve the hardest worlds that are possible, you have to engage with MIRI threat models which very few people can do well [1].
  - e.g. I expect Superalignment’s impact to mostly depend on their ability to adapt to knowledge about AI systems that we gain in the next 3 years, and continue working on relevant subproblems.
6. The usefulness of basic science is limited unless you can eventually demonstrate some application. We should feel worse about a basic science program the longer it goes without application, and try to predict how broad the application of potential basic science programs will be.
  - Glitch tokens work probably won’t go anywhere. But steering vectors are good because there are more powerful techniques in that space.
  - The usefulness of sparse coding depends on whether we get applications like sparse circuit discovery, or intervening on features in order to usefully steer model behavior. Likewise with circuits-style mechinterp, singular learning theory, etc.
7. There are convergent instrumental pressures towards catastrophic behavior given certain assumptions about how cognition works, but the assumptions are rather strong and it’s not clear if the argument goes through.
  - The arguments I currently think are strongest are Alex Turner’s power-seeking theorem and an informal argument about goals.
8. Thoughts on various research principles picked up from Nate Soares
  - You should have a concrete task in mind when you’re imagining an AGI or alignment plan: agree. I usually imagine something like “Apollo program from scratch”.
  - Non-adversarial principle (A safe AGI design should not become unsafe if any part of it becomes infinitely good at its job): unsure, definitely agree with weaker versions
    Garrabrant calls this robustness to relative scale
  - To make any alignment progress we must first understand cognition through either theory or interpretability: disagree
  - You haven’t engaged with the real problem until your alignment plan handles metacognition, self-modification, etc.: weakly disagree; wish we had some formalism for “weak metacognition” to test our designs against [2]
[1], [2]: I expect some but not all of the MIRI threat models to come into play. Like, when we put safeguards into agents, they’ll rip out or circumvent some but not others, and it’s super tricky to predict which. My research with Vivek often got stuck by worrying too much about reflection, others get stuck by worrying too little.

Thomas Kwa 26 Oct 2023 2:03 UTC
LW: 38 AF: 25
18
AF
in reply to: 1a3orn’s comment on: AI as a science, and three obstacles to alignment strategies
Does evolution ~= AI have predictive power apart from doom?
Evolution analogies predict a bunch of facts that are so basic they’re easy to forget about, and even if we have better theories for explaining specific inductive biases, the simple evolution analogies should still get some weight for questions we’re very uncertain about.
- Selection works well to increase the thing you’re selecting on, at least when there is also variation and heredity
- Overfitting: sometimes models overfit to a certain training set; sometimes species adapt to a certain ecological niche and their fitness is low outside of it
- Vanishing gradients: fitness increase in a subpopulation can be prevented by lack of correlation between available local changes to genes and fitness
- Catastrophic forgetting: when trained on task A then task B, models often lose circuits specific to task A; when put in environment A then environment B species often lose vestigial structures useful in environment A
- There’s a mostly unimodal and broad peak for optimal learning rate, just like for optimal mutation rate
- Adversarial training dynamics
  - Adversarial examples usually exist (there exist chemicals that can sterilize or poison most organisms)
  - Adversarial training makes models more robust (bacteria can evolve antibiotic resistance)
  - Adversarially trained models generally have worse performance overall (antibiotic-resistant bacteria are outcompeted by normal bacteria when there are no antibiotics)
  - The attacker can usually win the arms race of generating and defending against adversarial attacks (evolutionary arms races are very common)
- A few things that feel more tenuous
  - maybe NTK lottery ticket hypothesis; when mutation rates are low evolution can be approximated as taking the best-performing organism; when total parameter distance is small SGD can be approximated as taking the best-performing model from the parameter tangent space
  - maybe inner optimizers; transformers learn in context by gradient descent while evolution invents brains, positive and negative selection of T cells to prevent them attacking the body, probably other things
  - Task vectors: adding sparse task vectors together often produces a model that can do both tasks; giving an organism alleles for two unrelated genetic disorders often gives it both disorders
  - Grokking/punctuated equilibrium: in some circumstances applying the same algorithm for 100 timesteps causes much larger changes in model behavior / organism physiology than in other circumstances [edit: moved this from above because 1a3orn makes the case that it’s not very central]

Thomas Kwa

Air purifiers are highly suboptimal and could be >2.5x better.