Thomas Kwa

Karma: 4,184

Was on Vivek Hebbar’s team at MIRI, now working with Adrià Garriga-Alonso on various empirical alignment projects.

I’m looking for projects in interpretability, activation engineering, and control/oversight; DM me if you’re interested in working with me.

Goodhart in RL with KL: Appendix

Thomas Kwa18 May 2024 0:40 UTC

11 points

0 comments6 min readLW link

Thomas Kwa 17 May 2024 23:36 UTC
2 points
0
in reply to: Noosphere89’s comment on: Catastrophic Goodhart in RL with KL penalty
Also, why do you think that error is heavier tailed than utility?
Goodhart’s Law is really common in the real world, and most things only work because we can observe our metrics, see when they stop correlating with what we care about, and iteratively improve them. Also the prevalence of reward hacking in RL often getting very high values.
If the reward model is as smart as the policy and is continually updated with data, maybe we’re in a different regime where errors are smaller than utility.

Thomas Kwa 17 May 2024 22:00 UTC
2 points
0
in reply to: Noosphere89’s comment on: Catastrophic Goodhart in RL with KL penalty
It could be anything because KL divergence basically does not restrict the expected value of anything heavy-tailed. You could get finite utility and $\infty$ error, or the reverse, or infinity of both, or neither converging, or even infinite utility and negative infinity error—any of these with arbitrarily low KL divergence.
To draw any conclusions, you need to assume some joint distribution between the error and utility, and use some model of selection that is not optimal policies under a KL divergence penalty or limit. If they are independent and you think of optimization as conditioning on a minimum utility threshold, we proved last year that you get 0 of whichever has lighter tails and $\infty$ of whichever has heavier tails, unless the tails are very similar. I think the same should hold if you model optimization as best-of-n selection. But the independence assumption is required and pretty unrealistic, and you can’t weaken it in any obvious way.
Realistically I expect that error will be heavy-tailed and heavier-tailed than utility by default so error goes to infinity. But error will not be independent of utility, so the expected utility depends mostly on how good extremely high error outcomes are. The prospect of AIs creating some random outcome that we overestimated the utility of by 10 trillion points does not seem especially good, so I think we should not be training AIs to maximize this kind of static heavy-tailed reward function.

Thomas Kwa 17 May 2024 2:33 UTC
26 points
0
on: Do you believe in hundred dollar bills lying on the ground? Consider humming
If my interpretation is right, the relative dose from humming compared to NO nasal spray is >200 times lower than this post claims, so humming is unlikely to work.
I think 0.11 ppm*hrs means that the integral of the curve of NO concentration added by the nasal spray is 0.11 ppm*hr. This is consistent with the dose being 130µl of a dilute liquid. If NO is produced and reacts immediately, say in 20 seconds, this means the concentration achieved is 19.8 ppm, not 0.88 ppm, which seems far in excess of what is possible through humming. The study linked (Weitzberg et al) found nasal NO concentrations ranging between 0.08 and 1 ppm depending on subject, with the center (mean log concentration) being 0.252 ppm, not this post’s estimate of 2-3 ppm.
If the effectiveness of NO depends on the integral of NO concentration over time, then one would have to hum for 0.436 hours to match one spray of Enovid, and it is unclear if it works like this. It could be that NO needs to reach some threshold concentration >1ppm to have an antiseptic effect, or that the production of NO in the sinuses would drop off after a few minutes. On the other hand it could be that 0.252ppm is enough and the high concentrations delivered by Enovid are overkill. In this case humming would work, but so would a 100x lower dose of the nasal spray. Which someone should study inasmuch as you still believe in humming.

Thomas Kwa 15 May 2024 18:28 UTC
2 points
0
in reply to: Erik Jenner’s comment on: Catastrophic Goodhart in RL with KL penalty
This is a fair criticism. I changed “impossible” to “difficult”.

My main concern is with future forms of RL that are some combination of better at optimization (thus making the model more inner aligned even in situations it never directly sees in training) and possibly opaque to humans such that we cannot just observe outliers in the reward distribution. It is not difficult to imagine that some future kind of internal reinforcement could have these properties; maybe the agent simulates various situations it could be in without stringing them together into a trajectory or something. This seems worth worrying about even though I do not have a particular sense that the field is going in this direction.

Catastrophic Goodhart in RL with KL penalty

Thomas Kwa and Adrià Garriga-alonso

15 May 2024 0:58 UTC

49 points

7 comments7 min readLW link

Thomas Kwa 10 May 2024 22:39 UTC
4 points
0
in reply to: tailcalled’s comment on: tailcalled’s Shortform
Much dumber ideas have turned into excellent papers

Thomas Kwa 9 May 2024 21:24 UTC
2 points
0
in reply to: fowlertm’s comment on: fowlertm’s Shortform
Is there an AI transcript/summary?

Thomas Kwa 9 May 2024 21:19 UTC
LW: 7 AF: 4
0
AF
on: Thomas Kwa’s Shortform
I started a dialogue with @Alex_Altair a few months ago about the tractability of certain agent foundations problems, especially the agent-like structure problem. I saw it as insufficiently well-defined to make progress on anytime soon. I thought the lack of similar results in easy settings, the fuzziness of the “agent”/”robustly optimizes” concept, and the difficulty of proving things about a program’s internals given its behavior all pointed against working on this. But it turned out that we maybe didn’t disagree on tractability much, it’s just that Alex had somewhat different research taste, plus thought fundamental problems in agent foundations must be figured out to make it to a good future, and therefore working on fairly intractable problems can still be necessary. This seemed pretty out of scope and so I likely won’t publish.
Now that this post is out, I feel like I should at least make this known. I don’t regret attempting the dialogue, I just wish we had something more interesting to disagree about.

Thomas Kwa 8 May 2024 3:09 UTC
LW: 2 AF: 1
0
AF
on: A circuit for Python docstrings in a 4-layer attention-only transformer
The model ultimately predicts the token two positions after B_def. Do we know why it doesn’t also predict the token two after B_doc? This isn’t obvious from the diagram; maybe there is some way for the induction head or arg copying head to either behave differently at different positions, or suppress the information from B_doc.

Thomas Kwa 5 May 2024 7:20 UTC
2 points
0
in reply to: JBlack’s comment on: Thomas Kwa’s Shortform
The Brownian motion assumption is rather strong but not required for the conclusion. Consider the stock market, which famously has heavy-tailed, bursty returns. It happens all the time for the S&P 500 to move 1% in a week, but a 10% move in a week only happens a couple of times per decade. I would guess (and we can check) that most weeks have >0.6x of the average per-week variance of the market, which causes the median weekly absolute return to be well over half of what it would be if the market were Brownian motion with the same long-term variance.
Also, Lawrence tells me that in Tetlock’s studies, superforecasters tend to make updates of 1-2% every week, which actually improves their accuracy.

Thomas Kwa 5 May 2024 6:41 UTC
LW: 4 AF: 3
2
AF
in reply to: LawrenceC’s comment on: Thomas Kwa’s Shortform
I talked about this with Lawrence, and we both agree on the following:
- There are mathematical models under which you should update >=1% in most weeks, and models under which you don’t.
- Brownian motion gives you 1% updates in most weeks. In many variants, like stationary processes with skew, stationary processes with moderately heavy tails, or Brownian motion interspersed with big 10%-update events that constitute <50% of your variance, you still have many weeks with 1% updates. Lawrence’s model where you have no evidence until either AI takeover happens or 10 years passes does not give you 1% updates in most weeks, but this model is almost never the case for sufficiently smart agents.
- Superforecasters empirically make lots of little updates, and rounding off their probabilities to larger infrequent updates make their forecasts on near-term problems worse.
- Thomas thinks that AI is the kind of thing where you can make lots of reasonable small updates frequently. Lawrence is unsure if this is the state that most people should be in, but it seems plausibly true for some people who learn a lot of new things about AI in the average week (especially if you’re very good at forecasting).
- In practice, humans often update in larger discrete chunks. Part of this is because they only consciously think about new information required to generate new numbers once in a while, and part of this is because humans have emotional fluctuations which we don’t include in our reported p(doom).
- Making 1% updates in most weeks is not always just irrational emotional fluctuations; it is consistent with how a rational agent would behave under reasonable assumptions. However, we do not recommend that people consciously try to make 1% updates every week, because fixating on individual news articles is not the right way to think about forecasting questions, and it is empirically better to just think about the problem directly rather than obsessing about how many updates you’re making.

Thomas Kwa 4 May 2024 10:49 UTC
6 points
2
in reply to: Alexander Gietelink Oldenziel’s comment on: Thomas Kwa’s Shortform
To some degree yes, but I expect lots of information to be spread out across time. For example: OpenAI releases GPT5 benchmark results. Then a couple weeks later they deploy it on ChatGPT and we can see how subjectively impressive it is out of the box, and whether it is obviously pursuing misaligned goals. Over the next few weeks people develop post-training enhancements like scaffolding, and we get a better sense of its true capabilities. Over the next few months, debate researchers study whether GPT4-judged GPT5 debates reliably produce truth, and control researchers study whether GPT4 can detect whether GPT5 is scheming. A year later an open-weights model of similar capability is released and the interp researchers check how understandable it is and whether SAEs still train.

Thomas Kwa 4 May 2024 8:35 UTC
LW: 19 AF: 7
−14
AF
on: Thomas Kwa’s Shortform
You should update by +-1% on AI doom surprisingly frequently
This is just a fact about how stochastic processes work. If your p(doom) is Brownian motion in 1% steps starting at 50% and stopping once it reaches 0 or 1, then there will be about 50^2=2500 steps of size 1%. This is a lot! If we get all the evidence for whether humanity survives or not uniformly over the next 10 years, then you should make a 1% update 4-5 times per week. In practice there won’t be as many due to heavy-tailedness in the distribution concentrating the updates in fewer events, and the fact you don’t start at 50%. But I do believe that evidence is coming in every week such that ideal market prices should move by 1% on maybe half of weeks, and it is not crazy for your probabilities to shift by 1% during many weeks if you think about it often enough. [Edit: I’m not claiming that you should try to make more 1% updates, just that if you’re calibrated and think about AI enough, your forecast graph will tend to have lots of >=1% week-to-week changes.]
What links here?
- niplav's comment on quila’s Shortform by quila (12 May 2024 16:40 UTC; 4 points)

Thomas Kwa 2 May 2024 19:21 UTC
LW: 2 AF: 1
1
AF
in reply to: TurnTrout’s comment on: TurnTrout’s shortform feed
I’m not so sure that shards should be thought of as a matter of implementation. Contextually activated circuits are a different kind of thing from utility function components. The former activate in certain states and bias you towards certain actions, whereas utility function components score outcomes. I think there are at least 3 important parts of this:
- A shardful agent can be incoherent due to valuing different things from different states
- A shardful agent can be incoherent due to its shards being shallow, caring about actions or proximal effects rather than their ultimate consequences
- A shardful agent saves compute by not evaluating the whole utility function
The first two are behavioral. We can say an agent is likely to be shardful if it displays these types of incoherence but not others. Suppose an agent is dynamically inconsistent and we can identify features in the environment like cheese presence that cause its preferences to change, but mostly does not suffer from the Allais paradox, tends to spend resources on actions proportional to their importance for reaching a goal, and otherwise generally behaves rationally. Then we can hypothesize that the agent has some internal motivational structure which can be decomposed into shards. But exactly what motivational structure is very uncertain for humans and future agents. My guess is researchers need to observe models and form good definitions as they go along, and defining a shard agent as having compositionally represented motivators is premature. For now the most important thing is how steerable agents will be, and it is very plausible that we can manipulate motivational features without the features being anything like compositional.

Thomas Kwa 30 Apr 2024 7:16 UTC
3 points
−2
on: Thomas Kwa’s Shortform
Hangnails are Largely Optional
Hangnails are annoying and painful, and most people deal with them poorly. [1] Instead, use a drop of superglue to glue it to your nail plate. It’s $10 for 12 small tubes on Amazon. Superglue is also useful for cuts and minor repairs, so I already carry it around everywhere.
Hangnails manifest as either separated nail fragments or dry peeling skin on the paronychium (area around the nail). In my experience superglue works for nail separation, and a paper (available free on Scihub) claims it also works for peeling skin on the paronychium.
Is this safe?
Cyanoacrylate glue is regularly used in medicine to close wounds, and now frequently replaces stitches. Medical superglue has slightly different types of cyanoacrylate, but doctors I know say it’s basically the same thing.
I think medical superglue exists to prevent rare reactions and for large wounds where the exothermic reaction from a large quantity might burn you, and the safety difference for hangnails is minimal [2]. But to be extra safe you could just use 3M medical grade superglue or Dermabond.
[1]: Typical responses to hangnails include:
- Pulling them out, which can lead to further bleeding or infection.
- Trimming them with nail clippers, which often leaves a jagged edge.
- Wrapping the affected finger in a bandage, requiring daily changes.
[2]: There have been studies showing cytotoxicity in rabbits when injecting it in their eyes, or performing internal (bone or cartilage) grafts. A 2013 review says that although some studies have found internal toxicity, “[f]or wound closure and various other procedures, there have been a considerable number of studies finding histologic equivalence between ECA [commercial superglue] and more widely accepted modalities of repair.”

Thomas Kwa 25 Apr 2024 23:09 UTC
8 points
4
on: LLMs seem (relatively) safe
I don’t believe that data is limiting because the finite data argument only applies to pretraining. Models can do self-critique or be objectively rated on their ability to perform tasks, and trained via RL. This is how humans learn, so it is possible to be very sample-efficient, and currently a small proportion of training compute is RL.
If the majority of training compute and data are outcome-based RL, it is not clear that the “Playing human roles is pretty human” section holds, because the system is not primarily trained to play human roles.

Thomas Kwa 25 Apr 2024 21:45 UTC
52 points
8
on: Thomas Kwa’s Shortform
The cost of goods has the same units as the cost of shipping: $/kg. Referencing between them lets you understand how the economy works, e.g. why construction material sourcing and drink bottling has to be local, but oil tankers exist.
- An iPhone costs $4,600/kg, about the same as SpaceX charges to launch it to orbit. [1]
- Beef, copper, and off-season strawberries are $11/kg, about the same as a 75kg person taking a three-hour, 250km Uber ride costing $3/km.
- Oranges and aluminum are $2-4/kg, about the same as flying them to Antarctica. [2]
- Rice and crude oil are ~$0.60/kg, about the same as $0.72 for shipping it 5000km across the US via truck. [3,4] Palm oil, soybean oil, and steel are around this price range, with wheat being cheaper. [3]
- Coal and iron ore are $0.10/kg, significantly more than the cost of shipping it around the entire world via smallish (Handysize) bulk carriers. Large bulk carriers are another 4x more efficient [6].
- Water is very cheap, with tap water $0.002/kg in NYC. But shipping via tanker is also very cheap, so you can ship it maybe 1000 km before equaling its cost.
It’s really impressive that for the price of a winter strawberry, we can ship a strawberry-sized lump of coal around the world 100-400 times.
[1] iPhone is $4600/kg, large launches sell for $3500/kg, and rideshares for small satellites $6000/kg. Geostationary orbit is more expensive, so it’s okay for GPS satellites to cost more than an iPhone per kg, but Starlink wants to be cheaper.
[2] https://fred.stlouisfed.org/series/APU0000711415. Can’t find numbers but Antarctica flights cost $1.05/kg in 1996.
[3] https://www.bts.gov/content/average-freight-revenue-ton-mile
[4] https://markets.businessinsider.com/commodities
[5] https://www.statista.com/statistics/1232861/tap-water-prices-in-selected-us-cities/
[6] https://www.researchgate.net/figure/Total-unit-shipping-costs-for-dry-bulk-carrier-ships-per-tkm-EUR-tkm-in-2019_tbl3_351748799

Thomas Kwa 24 Apr 2024 2:04 UTC
15 points
4
on: Examples of Highly Counterfactual Discoveries?
Maybe Galois with group theory? He died in 1832, but his work was only published in 1846, upon which it kicked off the development of group theory, e.g. with Cayley’s 1854 paper defining a group. Claude writes that there was not much progress in the intervening years:
The period between Galois’ death in 1832 and the publication of his manuscripts in 1846 did see some developments in the theory of permutations and algebraic equations, which were important precursors to group theory. However, there wasn’t much direct progress on what we would now recognize as group theory.
Some notable developments in this period:
1. Cauchy’s work on permutations in the 1840s further developed the idea of permutation groups, which he had first explored in the 1820s. However, Cauchy did not develop the abstract group concept.
2. Plücker’s 1835 work on geometric transformations and his introduction of homogeneous coordinates laid some groundwork for the later application of group theory to geometry.
3. Eisenstein’s work on cyclotomy and cubic reciprocity in the 1840s involved ideas related to permutations and roots of unity, which would later be interpreted in terms of group theory.
4. Abel’s work on elliptic functions and the insolubility of the quintic equation, while published earlier, continued to be influential in this period and provided important context for Galois’ ideas.
However, none of these developments directly anticipated Galois’ fundamental insights about the structure of solutions to polynomial equations and the corresponding groups of permutations. The abstract concept of a group and the idea of studying groups in their own right, independent of their application to equations, did not really emerge until after Galois’ work became known.
So while the 1832-1846 period saw some important algebraic developments, it seems fair to say that Galois’ ideas on group theory were not significantly advanced or paralleled during this time. The relative lack of progress in these 14 years supports the view of Galois’ work as a singular and ahead-of-its-time discovery.

Thomas Kwa 24 Apr 2024 0:25 UTC
4 points
0
in reply to: Jeremy Gillen’s comment on: Without fundamental advances, misalignment and catastrophe are the default outcomes of training powerful AI
This is indeed a crux, maybe it’s still worth talking about.

Thomas Kwa

Good­hart in RL with KL: Appendix

Catas­trophic Good­hart in RL with KL penalty

Hangnails are Largely Optional

Is this safe?

Goodhart in RL with KL: Appendix

Catastrophic Goodhart in RL with KL penalty