JustinShovelain

Karma: 626

I am the co founder of and researcher at the quantitative long term strategy organization Convergence (see here for our growing list of publications). Over the last sixteen years I have worked with MIRI, CFAR, EA Global, and Founders Fund, and done work in EA strategy, fundraising, networking, teaching, cognitive enhancement, and AI safety research. I have a MS degree in computer science and BS degrees in computer science, mathematics, and physics.

JustinShovelain 20 May 2025 8:24 UTC
2 points
0
in reply to: Towards_Keeperhood’s comment on: Bounded AI might be viable
About getting coherent corrigibility, my and Joar’s post on Updating Utility Functions, makes some progress on a soft form of corrigibility.

Counter-considerations on AI arms races

Mateusz Bagiński and JustinShovelain

15 May 2025 14:54 UTC

24 points

0 comments18 min readLW link

Goodhart Typology via Structure, Function, and Randomness Distributions

JustinShovelain and Mateusz Bagiński

25 Mar 2025 16:01 UTC

35 points

1 comment15 min readLW link

Bounded AI might be viable

Mateusz Bagiński and JustinShovelain

6 Mar 2025 12:55 UTC

24 points

4 comments20 min readLW link

JustinShovelain 8 Jan 2025 11:34 UTC
4 points
−6
in reply to: MichaelDickens’s comment on: evhub’s Shortform
I agree.
Anthropic’s marginal contribution to safety (compared to what we would have in a world without Anthropic) probably doesn’t offset Anthropic’s contribution to the AI race.
I think there are more worlds where Anthropic is contributing to the race in a negative fashion than there are worlds where Anthropic’s marginal safety improvement over OpenAI/DeepMind-ish orgs is critical for securing a good future with AGI (weighing things according to the impact sizes and probabilities).

JustinShovelain 8 Jan 2025 11:31 UTC
1 point
0
in reply to: JustinShovelain’s comment on: evhub’s Shortform
More generally you can use the following typology to inspire creating more interventions.
Interventions points to change/form an AGI company and its surroundings towards safer x-risk results (I’ve used this in advising startups on AI safety, it is also related to my post on positions where people can be in the loop):
- Type of organization: nonprofit, public benefit organization, have a partner non-profit, join the government
- Rules of organization, event triggers:
  - Rules:
    x-risk mission statement
    x-risk strategic plan
  - Triggering events:
    Gets very big: windfall clause
    Gets sold to another party: ethics board, restrictions on potential sale
    Value drift: reboot board and CEOs, shut it down, allocate more resources to safety, build a new company, put the ethics board in charge, build a monitoring system, some sort of line in the sand
    AI safety isn’t viable yet but dangerous AGI is: shut it down or pivot to sub AGI research and product development
    Hostile government tries to take it over: shut it down, change countries, (see also: Soft Nationalization: How the US Government Will Control AI Labs)
- Path decisions for organization: ethics board, aligned investors, good CEOs, giving x-risk orgs or people choice power, voting stock to aligned investors, periodic x-risk safety reminders
- Resource allocation by organization: precommitting a varying percentage of money/time focused on x-risk reduction based on conditions with some up front, a commitment devices for funding allocation into the future
- Owners of organization: aligned investors, voting stock for aligned investors, necessary percentage as aligned investors
- Executive decision making: good CEOs, company mission statement?, company strategic plan?
- Employees: select employees preferably by alignment, have only aligned people hire folks
- Education of employees and/or investors by x-risk folks: employee training in x-risks and information hazards, a company culture that takes doing good seriously, coaching and therapy services
- Social environment of employees: exposure to EAs and x-risk people socially at events, x-risk community support grants, a public pledge
- Customers of organization: safety score for customers, differential pricing, customers have safety plans and information hazard plans
- Uses of the technology: terms of service
- Suppliers of organization: (mostly not relevant), select ethical or aligned suppliers
- Difficulty to steal or copy: trade secrets, patents, service based, NDAs, (physical security)
- Internal political hazards: (standard)
- Information hazards: an institutional framework for research groups (FHI has a draft document)
- Cyber hazards: (standard IT)
- Financial hazards: (standard finances)
- External political hazards: government industry partnerships, talk with x-risk folks about this, external x-risk outreach
- Monitoring by x-risk folks: quarterly reports to x-risk organizations,
- Projection by x-risk folks: commissioned projections, x-risk prediction market questions
- Meta research and x-risk research: AI safety team, AI safety grants, meet up on organization safety at X-risk orgs, (x-risk strategy, AI safety strategy) – team and grants, information hazard grant question, go through these ideas in a check list fashion and allocate company computer folders to them (and they will get filled up), scalable and efficient grant giving system, form an accelerator, competitions, hackathon, BERI type project support
- Coordination hazards: Incentivized coordination through cheap resources for joint projects, government industry partnerships, coordination theory and implementation grants, concrete coordination efforts, joint ethics boards, mergers with other groups to reduce arms race risks
- Specific safety procedures: (depends on the project)
- Jurisdiction: Choosing a good legal jurisdiction

JustinShovelain 8 Jan 2025 11:30 UTC
1 point
0
in reply to: evhub’s comment on: evhub’s Shortform
Thanks for asking the question!
Some things I’d especially like to see change (in as much as I know what is happening) are:
- Making more use of available options to improve AI safety (I think there are more than I get the impression that Anthropic thinks. For instance, 30% of funds could be allocated to AI safety research if framed well and it would probably be below the noise threshold/froth of VC investing. Also, there probably is a fair degree of freedom in socially promoting concern around unaligned AGI.)
- Explicit ways to handle various types of events like organizational value drift, hostile government takeover, organization get’s sold or unaligned investors have control, another AGI company takes a clear lead
- Enforceable agreements to, under some AGI safety situations, not race and pool resources (a possible analogy from nuclear safety is having a no first strike policy)
- Allocate a significant fraction of resources (like > 10% of capital) to AGI technical safety, organizational AGI safety strategy, and AGI governance
- An organization consists of its people and great care needs to be taken in hiring employees and and their training and motivation for AGI safety. If not, I expect Anthropic to regress towards the mean (via an eternal September) and we’ll end up with another OpenAI situation where AGI safety culture is gradually lost. I want more work to be done here. (see also “Carefully Bootstrapped Alignment” is organizationally hard)
- The owners of a company are also very important and ensuring that the LTBT has teeth and the members are selected well is key. Furthermore, preferential allocation of voting stock towards AGI algned investors should happen. Teaching investors about the company and what it does, including AGI safety issues, would be good to do. More speculatively, you can have various types of voting stock for various types of issues and you could build a system around this.

Information-Theoretic Boxing of Superintelligences

JustinShovelain and Elliot Mckernon

30 Nov 2023 14:31 UTC

31 points

0 comments7 min readLW link

JustinShovelain 5 Jul 2023 19:30 UTC
6 points
0
in reply to: Charlie Steiner’s comment on: Some background for reasoning about dual-use alignment research
Gotcha. What determines the “ratios” is some sort of underlying causal structure of which some aspects can be summarized by a tech tree. For thinking about the causal structure you may also like this post: https://forum.effectivealtruism.org/posts/TfRexamDYBqSwg7er/causal-diagrams-of-the-paths-to-existential-catastrophe

The risk-reward tradeoff of interpretability research

JustinShovelain and Elliot Mckernon

5 Jul 2023 17:05 UTC

16 points

1 comment6 min readLW link

JustinShovelain 5 Jul 2023 11:59 UTC
LW: 7 AF: 3
0
AF
on: Some background for reasoning about dual-use alignment research
Complementary ideas to this article:
- https://www.lesswrong.com/posts/BfKQGYJBwdHfik4Kd/fai-research-constraints-and-agi-side-effects: (the origin for the fuel tank metaphor Raemon refers to in these comments)
- Extending things further to handle higher order derivatives and putting things within a cohesive space: https://forum.effectivealtruism.org/posts/TCxik4KvTgGzMowP9/state-space-of-x-risk-trajectories
- A typology for mapping downside risks: https://www.lesswrong.com/posts/RY9XYoqPeMc8W8zbH/mapping-downside-risks-and-information-hazards
- A set of potential responses for what to do with potentially dangerous developments and a heuristic for triggering that evaluation: https://www.lesswrong.com/posts/6ur8vDX6ApAXrRN3t/information-hazards-why-you-should-care-and-what-you-can-do
- A general heuristic for what technology to develop and how to distribute it: https://forum.effectivealtruism.org/posts/4oGYbvcy2SRHTWgWk/improving-the-future-by-influencing-actors-benevolence
- A coherence focused framework from which is more fundamental than the link just above and from which it can be derived: https://www.lesswrong.com/posts/AtwPwD6PBsqfpCsHE/aligning-ai-by-optimizing-for-wisdom

Aligning AI by optimizing for “wisdom”

JustinShovelain and Elliot Mckernon

27 Jun 2023 15:20 UTC

28 points

8 comments12 min readLW link

Improving the safety of AI evals

JustinShovelain and Elliot Mckernon

17 May 2023 22:24 UTC

13 points

7 comments7 min readLW link

Keep humans in the loop

JustinShovelain and Elliot Mckernon

19 Apr 2023 15:34 UTC

23 points

1 comment10 min readLW link

JustinShovelain 6 Apr 2023 10:52 UTC
9 points
0
on: Dual-Useness is a Ratio
Relatedly, here is a post going beyond the framework of a ratio of progress to the effect on the ratio of research that still needs to be done for various outcomes: https://www.lesswrong.com/posts/BfKQGYJBwdHfik4Kd/fai-research-constraints-and-agi-side-effects
Extending further one can examine higher order derivatives and curvature in a space of existential risk trajectories: https://forum.effectivealtruism.org/posts/TCxik4KvTgGzMowP9/state-space-of-x-risk-trajectories

JustinShovelain 5 Jan 2023 12:09 UTC
3 points
0
on: When you plan according to your AI timelines, should you put more weight on the median future, or the median future | eventual AI alignment success? ⚖️
Roughly speaking, in terms of the actions you take, various timelines should be weighted as P(AGI in year t)*DifferenceYouCanProduceInAGIAlignmentAt(t). This produces a new, non normalized distribution of how much to prioritize each time (you can renormalize it if you wish to make it more like “probability”).
Note that this is just a first approximation and there are additional subtleties.
- This assumes you are optimizing for each time and possible world orthogonality but much of the time optimizing for nearby times is very similar to optimizing for a particular time.
- The definition of “you” here depends on the nature of the decision maker which can vary between a group, a person, or even a person at a particular moment.
- Using different definitions of “you” between decision makers can cause a coordination issue where different people are trying to save different potential worlds (because of their different skills and ability to produce change) and their plans may tangle with each other.
- It is difficult to figure out how much of a difference you can produce in different possible worlds and times. You do the best you can but you might suffer a failure of imagination in either finding ways your plans wont work, ways your plans will have larger positive effects, or ways you may in the future improve your plans. For more on the difference one can produce see this and this.
- Lastly, there is a risk here psychologically and socially of fudging the calculations above to make things more comfortable.
(Meta: I may make a full post on this someday and use this reasoning often)

Updating Utility Functions

JustinShovelain and Joar Skalse

9 May 2022 9:44 UTC

42 points

6 comments8 min readLW link

JustinShovelain 12 Apr 2022 6:36 UTC
8 points
0
in reply to: Davidmanheim’s comment on: Goodhart’s Law Causal Diagrams
I think causal diagrams naturally emerge when thinking about Goodhart’s law and its implications.
I came up with the concept of Goodhart’s law causal graphs above because of a presentation someone gave at the EA Hotel in late 2019 of Scott’s Goodhart Taxonomy. I thought causal diagrams were a clearer way to describe some parts of the taxonomy but their relationship to the taxonomy is complex. I also just encountered the paper you and Scott wrote a couple weeks ago when getting ready to write this Good Heart Week prompted post, and I was planning in the next post to reference it when we address “causal stomping” and “function generalization error” and can more comprehensively describe the relationship with the paper.
In terms of the relationship to the paper, I think that the Goodhart’s law causal graphs I describe above are more fundamental and atomically describe the relationship types between the target and proxies in a unified way. I read how you were using causal diagrams in your paper as rather describing various ways causal graph relationships may be broken by taking action rather than simply describing relationships between proxies and targets and ways they may be confused with each other (which is the function of the Goodhart’s law causal graphs above).
Mostly the purpose of this post and the next are to present an alternative, and I think cleaner, ontological structure for thinking about Goodhart’s law though there will still be some messiness in carving up reality.
As to your suggested mitigations, both randomization and secret metric are good to add though I’m not as sure about post hoc. Thanks for the suggestions and the surrounding paper.

Goodhart’s Law Causal Diagrams

JustinShovelain and Jeremy Gillen

11 Apr 2022 13:52 UTC

35 points

6 comments6 min readLW link

How Money Fails to Track Value

JustinShovelain2 Apr 2022 12:32 UTC

17 points

0 comments5 min readLW link