tlevin

Karma: 893

(Posting in a personal capacity unless stated otherwise.) I help allocate Open Phil’s resources to improve the governance of AI with a focus on avoiding catastrophic outcomes. Formerly co-founder of the Cambridge Boston Alignment Initiative, which supports AI alignment/safety research and outreach programs at Harvard, MIT, and beyond, co-president of Harvard EA, Director of Governance Programs at the Harvard AI Safety Team and MIT AI Alignment, and occasional AI governance researcher.

Not to be confused with the user formerly known as trevor1.

tlevin 20 Aug 2025 1:58 UTC
3 points
2
on: Models Don’t “Get Reward”
Coming back >2.5 years later to say this is among the most helpful pieces of AI writing I’ve ever read—I remember it being super clarifying at the time, and I still link people to it, cite it in conversation, and use similar analogies. (Even though I now also caveat it with ”...but also maybe really sophisticated agents will actively seek it, because the training environment might reward it, and maybe they’ll ‘experience’ something like fear/pain/etc for things correlated with negative reward, if they experience things...”) Thank you for writing it up!!

tlevin 3 Aug 2025 0:26 UTC
4 points
2
on: The Observer Effect for belief measurement
Half-baked thought: this also creates another problem for using prediction markets to make decisions, as bettors should update on the question being asked in the first place (though this could be mitigated a lot by e.g. the existence of the problem being required by law or typical practice etc, as in the original futarchy proposal).

tlevin 12 Jul 2025 1:43 UTC
2 points
−1
in reply to: habryka’s comment on: tlevin’s Shortform
Alas, was hoping the smiley at the end would give it away...

tlevin 11 Jul 2025 18:21 UTC
3 points
0
in reply to: johnswentworth’s comment on: tlevin’s Shortform
Interesting point! Do you have any ideas for how I could read more about potential problems in air conditioner ratings (by Wirecutter and others)? Like, way more? :)

tlevin 10 Jul 2025 22:43 UTC
25 points
2
on: tlevin’s Shortform
Prime Day (now not just an amazon thing?) ends tomorrow, so I scanned Wirecutter’s Prime Day page for plausibly-actually-life-improving purchases so you didn’t have to (plus a couple others I found along the way; excludes tons of areas that I’m not familiar with, like women’s clothing or parenting):
Seem especially good to me:
- Their “budget pick” for best office chair $60 off
- Whoop sleep tracker $40 off
- Their top pick for portable computer monitor $33 off (I personally endorse this in particular)
- Their top pick for CO2 (and humidity) monitor $31 off
- Crest whitening strips $14 off (I personally endorse this in particular)
- 3-pack of their top pick for umbrellas, $12 off
- Their top pick for sleep mask $12 off
- Their top pick for electric toothbrush $10 off
- 6-pair pack of good and super-affordable socks $4 off (I personally endorse this in particular; see my previous enthusiasm for bulk sock-buying in general and these in particular here)
Could also be good:
- A top hybrid mattress $520 off
- A top inner-spring mattress pick $400 off
- Their top pick for large carry-on $59 off
- Their “budget pick” for weighted blanket $55 off
- Their top pick for best air conditioner $50 off
- Their top pick for laptop backpacks $45 off
- 3-in-1 travel charging station $24 off
- Top pick for face sunscreen $16 off
- Uber/UberEats gift card $15 off (basically free $15 if you ever use Uber or UberEats; see my previous enthusiasm for these gift cards as sold at 80% face value at Costco here)
- 4-pack of Apple AirTags $15 off
- Their “budget pick” for bidets $12 off
- Their top pick for towels $10 off
- Good and super-affordable portable Bluetooth speaker $8 off
- Top pick for portable mosquito repellent $7 off
- 3-pack of good and super-affordable sunglasses $4 off
- Many, many mechanical keyboards, headphones, smart watches, wifi extenders, chargers/cables, and gaming mice if you’re in the market for any of those

[linkpost] One Year in DC

tlevin19 May 2025 19:46 UTC

110 points

5 comments1 min readLW link

(www.greentape.pub)

tlevin 12 May 2025 4:05 UTC
50 points
14
on: tlevin’s Shortform
Found this passage from Schelling’s Strategy of Conflict to be useful in understanding how norms react, and how they should react, to behavior that is not necessarily harmful but is often associated with harm, cf. “decoupling.” Emphasis mine:
It is essential, therefore, for maximum credibility, to leave as little room as possible for judgment or discretion in carrying out the threat. If one is committed to punish a certain type of behavior when it reaches certain limits, but the limits are not carefully and objectively defined, the party threatened will realize that when the time comes to decide whether the threat must be enforced or not, his interest and that of the threatening party will coincide in an attempt to avoid the mutually unpleasant consequences.
In order to make a threat precise, so that its terms are visible both to the threatened party and to any third parties whose reaction to the whole affair is of value to the adversaries, it may be necessary to introduce some arbitrary elements. The threat must involve overt acts rather than intentions; it must be attached to the visible deeds, not invisible ones; it may have to attach itself to certain ancillary actions that are of no consequence in themselves to the threatening party. It may, for example, have to put a penalty on the carrying of weapons rather than their use; on suspicious behavior rather than observed misdemeanors; on proximity to a crime rather than the crime itself. And, finally, the act of punishment must be one whose effect or influence is clearly discernible.
In order that one be able to pledge his reputation behind a threat, there must be continuity between the present and subsequent issues that will arise. This need for continuity suggests a means of making the original threat more effective; if it can be decomposed into a series of consecutive smaller threats, there is an opportunity to demonstrate on the first few transgressions that the threat will be carried out on the rest. Even the first few become more plausible, since there is a more obvious incentive to fulfill them as a “lesson.” This principle is perhaps most relevant to acts that are inherently a matter of degree. In foreign aid programs the overt act of terminating assistance may be so obviously painful to both sides as not to be taken seriously by the recipient, but if each small misuse of funds is to be accompanied by a small reduction in assistance, never so large as to leave the recipient helpless nor to provoke a diplomatic breach, the willingness to carry it out will receive more credulity; or if it does not at first, a few lessons may be persuasive without too much damage.
The threatening party may not, of course, be able to divide the act into steps. (Both the act to be deterred and the punishment must be divisible.) But the principle at least suggests the un-wisdom of defining aggression, or transgression, in terms of some critical degree or amount that will be deemed intolerable. When the act to be deterred is inherently a sequence of steps whose cumulative effect is what matters, a threat geared to the increments may be more credible than one that must be carried out either all at once or not at all when some particular point has been reached. It may even be impossible to define a “critical point” with sufficient clarity to be persuasive. To make the threatened acts divisible, the acts themselves may have to be modified. Parts of an act that cannot be decomposed may have to be left out; ancillary acts that go with the event, though of no interest in themselves, may be objects to which a threat can effectively be attached. For example, actions that are only preparatory to the main act, and by themselves do no damage, may be susceptible of chronological division and thus be effective objects of the threat. The man who would kick a dog should be threatened with modest punishment for each step toward the dog, even though his proximity is of no interest in itself.

tlevin 12 May 2025 3:32 UTC
11 points
6
on: Consider not donating under $100 to political candidates
Cross-posting this comment from the EA Forum:
I basically agree with this (and might put the threshold higher than $100, probably much higher for people actively pursuing policy careers), with the following common exceptions:
It seems pretty low-cost to donate to a candidate from Party X if...
- You’ve already made donations to Party X. Larger and more recent ones are slightly worse, but as Daniel notes, even small ones from several elections ago can come back to bite.
- You don’t see a realistic world where you go into the federal government during a Party Y administration even if you didn’t donate to Party X, because...
  - You don’t think you could go into the federal government at all (though as Daniel notes, you may not realize at the time of making the donation that you might want to later; what I have in mind is like, you have significantly below average people skills, and/or you’ve somehow disqualified yourself).
  - You have a permanently discoverable digital paper trail of criticizing Party Y, e.g. social media posts, op-eds, etc.
  - You just don’t think you’d be able to stomach working in a Party Y administration. (Though consider asking, would you really not be able to stomach it for a few years if it seemed like an amazing career and impact opportunity?)

tlevin 21 Apr 2025 4:14 UTC
3 points
1
in reply to: johnswentworth’s comment on: johnswentworth’s Shortform
I disagree that the default would’ve been that the board would’ve been “easy for the labs to capture” (indeed, among the most prominent and plausible criticisms of its structure was that it would overregulate in response to political pressure), and thus that it wouldn’t have changed deployment practices. I think the frontier companies were in a good position to evaluate this, and they decided to oppose the bill (and/or support it conditional on sweeping changes, including the removal of the Frontier Model Division).
Also, I’m confused when policy skeptics say things like “sure, it might slow down timelines by a factor of 2-3, big deal.” Having 2-3x as much time is indeed a big deal!

tlevin 21 Apr 2025 3:46 UTC
8 points
3
on: tlevin’s Shortform
Potential high-ROI purchase: lots of the same decent, cheap socks. I’ve saved several minutes per week of pairing socks by getting two packs of these. 12 pairs will set you back $30-$34, depending on size (plus shipping if applicable); they’re listed in Wirecutter’s “best no-show socks” (even though they do show while wearing most sneakers; aren’t all socks “no-show” socks if you posit a tall enough shoe?).
(Of course, this is less useful if you’re fine with wearing mismatched socks anyway. But I think the asymmetry looks slightly bad and it sends a weakly bad signal re: conscientiousness, so I try to avoid it.)

tlevin 28 Feb 2025 3:23 UTC
1 point
0
in reply to: Anthony DiGiovanni’s comment on: tlevin’s Shortform
Sequence thinking can totally generate that, but it seems like it is also prone to this kind of stylized simple model where you wind up with too few arrows in your causal graph and then inaccurately conclude that some parts are necessary and others aren’t helpful.

tlevin 25 Feb 2025 4:44 UTC
31 points
3
on: tlevin’s Shortform
Biggest disagreement between the average worldview of people I met with at EAG and my own is something like “cluster thinking vs sequence thinking,” where people at EAG were often like “but even if we get this specific policy/technical win, doesn’t it not matter unless you also have this other, harder thing?” and I was often more like, “Well, very possibly we won’t get that other, harder thing, but still seems really useful to get that specific policy/technical win, here’s a story where we totally fail on that first thing and the second thing turns out to matter a ton!”

Skepticism towards claims about the views of powerful institutions

tlevin13 Feb 2025 7:40 UTC

46 points

2 comments4 min readLW link

tlevin 3 Dec 2024 16:31 UTC
3 points
0
in reply to: Zac Hatfield-Dodds’s comment on: A case for donating to AI risk reduction (including if you work in AI)
Agreed, I think people should apply a pretty strong penalty when evaluating a potential donation that has or worsens these dynamics. There are some donation opportunities that still have the “major donors won’t [fully] fund it” and “I’m advantaged to evaluate it as an AIS professional” without the “I’m personal friends with the recipient” weirdness, though—e.g. alignment approaches or policy research/advocacy directions you find promising that Open Phil isn’t currently funding that would be executed thousands of miles away.

A case for donating to AI risk reduction (including if you work in AI)

tlevin2 Dec 2024 19:05 UTC

60 points

2 comments3 min readLW link

tlevin 20 Nov 2024 20:22 UTC
5 points
2
in reply to: Orpheus16’s comment on: Akash’s Shortform
Depends on the direction/magnitude of the shift!
I’m currently feeling very uncertain about the relative costs and benefits of centralization in general. I used to be more into the idea of a national project that centralized domestic projects and thus reduced domestic racing dynamics (and arguably better aligned incentives), but now I’m nervous about the secrecy that would likely entail, and think it’s less clear that a non-centralized situation inevitably leads to a decisive strategic advantage for the leading project. Which is to say, even under pretty optimistic assumptions about how much such a project invests in alignment, security, and benefit-sharing, I’m pretty uncertain that this would be good, and with more realistic assumptions I probably lean towards it being bad. But it super depends on the governance, the wider context, how a “Manhattan Project” would affect domestic companies and China’s policymaking, etc.
(I think a great start would be not naming it after the Manhattan Project, though. It seems path dependent, and that’s not a great first step.)

tlevin 30 Aug 2024 0:27 UTC
3 points
2
in reply to: habryka’s comment on: tlevin’s Shortform
It’s not super clear whether from a racing perspective having an equal number of nukes is bad. I think it’s genuinely messy (and depends quite sensitively on how much actors are scared of losing vs. happy about winning vs. scared of racing).

Importantly though, once you have several thousand nukes the strategic returns to more nukes drop pretty close to zero, regardless of how many your opponents have, while if you get the scary model’s weights and then don’t use them to push capabilities even more, your opponent maybe gets a huge strategic advantage over you. I think this is probably true, but the important thing is whether the actors think it might be true.
In-general I think it’s very hard to predict whether people will overestimate or underestimate things. I agree that literally right now countries are probably underestimating it, but an overreaction in the future also wouldn’t surprise me very much (in the same way that COVID started with an underreaction, and then was followed by a massive overreaction).
Yeah, good point.

tlevin 29 Aug 2024 23:58 UTC
3 points
0
in reply to: habryka’s comment on: tlevin’s Shortform
Yeah doing it again it works fine, but it was just creating a long list of empty bullet points (I also have this issue in GDocs sometimes)

tlevin 29 Aug 2024 23:56 UTC
3 points
2
in reply to: habryka’s comment on: tlevin’s Shortform
Gotcha. A few disanalogies though—the first two specifically relate to the model theft/shared access point, the latter is true even if you had verifiable API access:
1. Me verifying how many nukes you have doesn’t mean I suddenly have that many nukes, unlike AI model capabilities, though due to compute differences it does not mean we suddenly have the same time distance to superintelligence.
2. Me having more nukes only weakly enables me to develop more nukes faster, unlike AI that can automate a lot of AI R&D.
3. This model seems to assume you have an imprecise but accurate estimate of how many nukes I have, but companies will probably be underestimating the proximity of each other to superintelligence, for the same reason that they’re underestimating their own proximity to superintelligence, until it’s way more salient/obvious.

tlevin 29 Aug 2024 23:49 UTC
6 points
1
on: Monthly Roundup #21: August 2024
In general, we should be wary of this sort of ‘make things worse in order to make things better.’ You are making all conversations of all sizes worse in order to override people’s decisions.
Glad to be included in the roundup, but two issues here.
First, it’s not about overriding people’s decisions; it’s a collective action problem. When the room is silent and there’s a single group of 8, I don’t actually face a choice of a 2- or 3-person conversation; it doesn’t exist! The music lowers the costs for people to split into smaller conversations, so the people who prefer those now have better choices, not worse.
Second, this is a Simpson’s Paradox-related fallacy: you are indeed making all conversations more difficult, but in my model, smaller conversations are much better, so by making conversations of all sizes slightly to severely worse but moving the population to smaller conversations, you’re still improving the conversations on net.

tlevin

[linkpost] One Year in DC

Skep­ti­cism to­wards claims about the views of pow­er­ful institutions

A case for donat­ing to AI risk re­duc­tion (in­clud­ing if you work in AI)

Skepticism towards claims about the views of powerful institutions

A case for donating to AI risk reduction (including if you work in AI)