Ryan Kidd(Ryan Kidd)

Karma: 932

Give me feedback! :)

Current

Co-Director at ML Alignment & Theory Scholars Program (2022-current)
Co-Founder & Board Member at London Initiative for Safe AI (2023-current)
Manifund Regrantor (2023-current)

Past

Ph.D. in Physics from the University of Queensland (2017-2022)
Group organizer at Effective Altruism UQ (2018-2021)

Introduction to inaccessible information

Ryan Kidd9 Dec 2021 1:28 UTC

27 points

6 comments8 min readLW link

Ryan Kidd 13 Dec 2021 2:40 UTC
3 points
in reply to: Charlie Steiner’s comment on: Introduction to inaccessible information
I think world model mismatches are possibly unavoidable with prosaic AGI, which might reasonably bias one against this AGI pathway. It seems possible that much of human and AGI world models would be similar by default if ‘tasks humans are optimised for’ is a similar set to ‘tasks AGI is optimised for’ and compute is not a performance-limiting factor, but I’m not at all confident that this is likely (e.g. maybe an AGI draws coarser- or finer-grained symbolic Markov blankets). Even if we build systems that represent the things we want and the things we do to get them as distinct symbolic entities in the same way humans do, they might fail to be competitive with systems that build their world models in an alien way (e.g. draw Markov blankets around symbolic entities that humans cannot factor into their world model due to processing or domain-specific constraints).

Depending on how one thinks AGI development will happen (e.g. is the strategy stealing assumption important) resolving world model mismatches seems more or less a priority for alignment. If near-term performance competitiveness heavily influences deployment, I think it’s reasonably likely that prosaic AGI is prioritised and world model mismatches occur by default because, for example, compute is likely a performance-limiting factor for humans on tasks we optimise AGI for, or the symbolic entities humans use are otherwise nonuniversal. I think AGI might generally require incorporating alien features into world models to be maximally competitive, but I’m very new to this field.

Ryan Kidd 13 Dec 2021 11:04 UTC
3 points
in reply to: Charlie Steiner’s comment on: Introduction to inaccessible information
I think I agree. To the extent that a ‘world model’ is an appropriate abstraction, I think the levers to pull for resolving world model mismatches seem to be:
- Post-facto: train an already capable (prosaic?) AI to explain itself in a way that accounts for world model mismatches via a clever training mechanism and hope that only accessible consequences matter for preserving human option value; or
- Ex-ante: build AI systems in an architecturally transparent manner such that properties of their world model can be inspected and tuned, and hope that the training process makes these AI systems competitive.
I think you are advocating for the latter, or have I misrepresented the levers?

Is Fisherian Runaway Gradient Hacking?

Ryan Kidd10 Apr 2022 13:47 UTC

15 points

6 comments4 min readLW link

Ryan Kidd 10 Apr 2022 15:48 UTC
1 point
in reply to: AlexMennen’s comment on: Is Fisherian Runaway Gradient Hacking?
In the context of your model, I see two potential ways that Fisherian runaway might occur:
1. Within each generation, males that survive with higher $X$ are consistently fitter on average than males that survive with lower $X$ because the fitness required to survive monotonically increases with $X$ . Therefore, in every generation, choosing males with higher $X$ is a good proxy for local improvements in fitness. However, the performance detriments of high $X$ “off-distribution” are never signalled. In an ML context, this is basically distributional shift via proxy misalignment.
2. Positive feedback that negatively impacts fitness “on-distribution” might occur temporarily if selection for higher $X$ is so strong that it has “acquired momentum” that ensures females will select for higher $X$ males for several generations past the point the trait becomes net costly for fitness. This is possible if the negative effects of the trait take longer to manifest selection pressure than the time window during which sexual selection boosts the trait via preferential mating. This mechanism is temporary, however, but I can see search processes halting prematurely in an ML context.

Ryan Kidd 11 Apr 2022 17:59 UTC
1 point
in reply to: AlexMennen’s comment on: Is Fisherian Runaway Gradient Hacking?
I think it’s important to distinguish between “fitness as evaluated on the training distribution” (i.e. the set of environments ancestral peacocks roamed) and “fitness as evaluated on a hypothetical deployment distribution” (i.e. the set of possible predation and resource scarcity environments peacocks might suddenly face). Also important is the concept of “path-dependent search” when fitness is a convex function on $X$ which biases local search towards $X = 1$ , but has global minimum at $X = - 1$ .
1. In this case, I’m imagining that Fisherian runaway boosts $X$ as long as it still indicates good fitness on-distribution. However, it could be that $X = 1$ is the “local optimum for fitness” and in reality $X = - 1$ is the global optimum for fitness. In this case, the search process has chosen an intiial $X$ -direction that biases sexual selection towards $X = 1$ . This is equivalent to gradient descent finding a local minima.
2. I think I agree with your thoughts here. I do wonder if sexual selection in humans has reached a point where we are deliberately immune to natural selection pressure due to such a distributional shift and acquired capabilities.

Ensembling the greedy doctor problem

Ryan Kidd18 Apr 2022 19:16 UTC

7 points

12 comments2 min readLW link

Ryan Kidd 19 Apr 2022 9:08 UTC
1 point
in reply to: Donald Hobson’s comment on: Ensembling the greedy doctor problem
I’m assuming that race and sex and all other discernable features are independent and identically distributed, and also that the game overseer will not pay the doctors if they coordinate on obvious patterns involving race, sex, etc.

Ryan Kidd 19 Apr 2022 9:14 UTC
1 point
in reply to: Ryan Kidd’s comment on: Ensembling the greedy doctor problem
If only a few of the patients are ambiguously ill, it might be possible to discern this by observing an almost perfect agreement between ensemble diagnoses. In the iterated game, this might cause the doctors to preferentially flip diagnoses that they think are more ambiguous or not flip diagnoses that they think are more certain. This is not a perfect guarantee, of course.

Ryan Kidd 19 Apr 2022 13:03 UTC
1 point
in reply to: Epirito’s comment on: Ensembling the greedy doctor problem
Treating health as a continuous rather than binary variable does complicate this problem and, I think, breaks my solution. If the doctors agree on an ordinal ranking of all patients from “most ill” to “most hale”, they can coordinate their diagnoses much easier, as they can search over a smaller space of ensemble diagnoses. If there are lots of “degenerate cases” (i.e. people with the same degree of illness) this might be harder. Requiring a certain minimum Hamming distance (based on some prior) from the all-ill Schelling point doesn’t help at all in the case of nondegenerate ordinal ranking.

Ryan Kidd 19 Apr 2022 13:06 UTC
3 points
in reply to: Measure’s comment on: Ensembling the greedy doctor problem
The problem with this solution is that the doctors can precommit to always diagnosing ill through acausal trade, so that they are rewarded in the case that they are the one who is randomly chosen for treatment (assuming they know the game).

SERI ML Alignment Theory Scholars Program 2022

Ryan Kidd, Victor Warlop and ozhang

27 Apr 2022 0:43 UTC

67 points

6 comments3 min readLW link

Ryan Kidd 14 May 2022 17:58 UTC
1 point
on: SERI ML Alignment Theory Scholars Program 2022
Application deadlines have been extended to May 22! Feel free message me or Victor if you have any questions.

Ryan Kidd 20 May 2022 20:53 UTC
1 point
in reply to: Spencer Becker-Kahn’s comment on: SERI ML Alignment Theory Scholars Program 2022
Yes, that is currently the plan. If we experience a massive influx of applications in the next two days it is possible this might slightly change, but I doubt it. We will work hard to keep to the announcement and commencement deadlines.

Selection processes for subagents

Ryan Kidd30 Jun 2022 23:57 UTC

36 points

2 comments9 min readLW link

SERI MATS Program—Winter 2022 Cohort

Ryan Kidd, Victor Warlop and Christian Smith

8 Oct 2022 19:09 UTC

72 points

12 comments4 min readLW link

Ryan Kidd 13 Oct 2022 19:12 UTC
1 point
−1
on: Ryan Kidd’s Shortform
Are GPT-n systems more likely to:
1. Learn superhuman cognition to predict tokens better and accurately express human cognitive failings in simulacra because they learned these in their “world model”; or
2. Learn human-level cognition to predict tokens better, including human cognitive failings?

Ryan Kidd 28 Oct 2022 17:56 UTC
1 point
0
on: Ryan Kidd’s Shortform
How does the failure rate of a hierarchy of auditors scale with the hierarchy depth, if the auditors can inspect all auditors below their level?

Ryan Kidd 28 Oct 2022 23:01 UTC
1 point
0
on: Ryan Kidd’s Shortform
Are these framings of gradient hacking, which I previously articulated here, a useful categorization?
1. Masking: Introducing a countervailing, “artificial” performance penalty that “masks” the performance benefits of ML modifications that do well on the SGD objective, but not on the mesa-objective;
2. Spoofing: Withholding performance gains until the implementation of certain ML modifications that are desirable to the mesa-objective; and
3. Steering: In a reinforcement learning context, selectively sampling environmental states that will either leave the mesa-objective unchanged or “steer” the ML model in a way that favors the mesa-objective.

Ryan Kidd 16 Nov 2022 0:06 UTC
1 point
0
on: Ryan Kidd’s Shortform
Can the strategy of “using surrogate goals to deflect threats” be countered by an enemy agent that learns your true goals and credibly precommits to always defecting (i.e., Prisoner’s Dilemma style) if you deploy an agent against it with goals that produce sufficiently different cooperative bargaining equilibria than your true goals would?

Ryan Kidd(Ryan Kidd)

Current

Past

In­tro­duc­tion to in­ac­cessible information

Is Fish­e­rian Ru­n­away Gra­di­ent Hack­ing?

Ensem­bling the greedy doc­tor problem

SERI ML Align­ment The­ory Schol­ars Pro­gram 2022

Selec­tion pro­cesses for subagents

SERI MATS Pro­gram—Win­ter 2022 Cohort