Nina Rimsky

Karma: 1,393

https://ninarimsky.substack.com/

https://ninarimsky.com/

Nina Rimsky 26 May 2022 17:18 UTC
2 points
on: [$20K in Prizes] AI Safety Arguments Competition
A system that is optimizing a function of n variables, where the objective depends on a subset of size k<n, will often set the remaining unconstrained variables to extreme values; if one of those unconstrained variables is actually something we care about, the solution found may be highly undesirable. - Stuart Russell

The pointers problem, distilled

Nina Rimsky26 May 2022 22:44 UTC

16 points

0 comments2 min readLW link

Nina Rimsky 26 May 2022 22:56 UTC
1 point
on: [$20K in Prizes] AI Safety Arguments Competition
What will happen if an AI realises that it is in a training loop? There are a lot of bad scenarios that could branch out from this point. Potentially this scenario sounds weird or crazy, however, humans possess the ability to introspect and philosophise on similar topics, even though our brains are “simply” computational apparatus which do not seem to possess any qualities that a sufficiently advanced AI could not possess.

Nina Rimsky 27 May 2022 0:46 UTC
1 point
in reply to: Jeremy Gillen’s comment on: On inner and outer alignment, and their confusion
I currently see inner alignment problems as a superset of generalisation error and robustness. Furthermore, an AI being a mesa-optimiser with a misaligned objective can also be thought of as a generalisation error seeing as this means we haven’t tested the AI in scenarios where it’s mesa-objective behaves differently from the base objective. The conclusion is meant to emphasise the possibility of extending the concept of inner misalignment to AI’s that we do not model as optimisers. I am open to the claim that this is not useful, and we should only use the term when we think of the AI as an optimiser. In which case the definition involving mesa-objectives is sufficient.

Nina Rimsky 27 May 2022 9:52 UTC
3 points
in reply to: Jeremy Gillen’s comment on: On inner and outer alignment, and their confusion
I think any inner alignment problem can be thought of as a kind of generalisation error (this wouldn’t have happened if we had more data), including misaligned mesa-optimisers. So yes, you are correct, in my model they are different ways of looking at the same problem (in hindsight, superset was a wrong word to use). Is your opinion that inner misalignment should only be used in cases when a mesa-optimiser can be shown to exist (which is the original definition and that stated by the comment you linked)? I agree, that would make sense also. I was starting with an assumption that “that which is not outer misalignment should be inner misalignment” but I notice that Evan mentions problems that are neither (eg: mis-generalisations when there are no mesa-optimisers). This way of defining things only works if you commit to seeing the AI in terms of it being an optimiser, which is indeed a useful framing, but not the only one. However, based on your (and Evan’s) comments I do see how having inner alignment as a subset of things-that-are-not-outer-alignment also works.

Nina Rimsky 9 Jun 2022 8:35 UTC
1 point
in reply to: Jeremy Gillen’s comment on: On inner and outer alignment, and their confusion
Firstly, yes I agree that it makes a lot of sense to defer to Evan who coined the term, and as far as we both can tell he meant the narrow definition. I actually read that comment before and misremembered its content so was originally under the impression that Evan had revised the definition to be broader, but then realized this is not the case.
I am still skeptical that there is any clear difference between optimizer / non-optimizer AI’s. Any AI that does a task well is in some sense optimizing for good performance on that task. This is what makes it hard for me to clearly see a case of generalization error that is not inner misalignment.
However, I can see how this can just be a framing thing where depending on how you look at the problem it’s easier to describe as “this AI has the wrong objective” vs “this AI has the correct objective but pursues it badly due to generalization error”. In any case, both of these also seem equally dangerous to me.
The problem with distinguishing these is that for a sufficiently complex training objective, even a very powerful agent -y AI will have a “fuzzy” goal that isn’t an exact specification of what it should do (for example, humans don’t have clearly defined objectives that they consistently pursue). This fuzzy goal is like a cluster of possible worlds towards which the AI is causing our current world to tend, via its actions/outputs. Pursuing the goal badly means having an overly fuzzy goal where some of the possible convergent worlds are not what we want. Inner misalignment, or having the wrong goal, will also look very similar, although perhaps a distinction you could make is that with inner misalignment fuzzy goal has to be in some sense miscentered.

Nina Rimsky 24 Jul 2022 16:59 UTC
3 points
0
on: Changing the world through slack & hobbies
Idea—set up a “slack Slack” where people doing (or wanting to do) impactful / interesting side projects can find collaborators / share ideas and advice / keep each other accountable

Nina Rimsky 12 Aug 2022 13:09 UTC
1 point
0
in reply to: Jesse Kanner’s comment on: What kind of moral framework would you spread if you could decide?
You’re able to set everyone’s moral framework and create new humans, however once they are created you cannot undo or try again. You also cannot rely on being able to influence the world post creation.

Assume humans will be placed on the planet in an evolved state (like current Homo Sapiens) and they can continue evolving but will possess a pretty strong drive to follow the framework you embed (akin to the disgust response that humans have to seeing gore or decay).

Nina Rimsky 12 Aug 2022 13:11 UTC
1 point
0
in reply to: Roven Skyfal’s comment on: What kind of moral framework would you spread if you could decide?
Interesting! I can see where you are coming from with this idea. I feel like the question gets me to think about what the optimal framework would be based on how the whole system would behave / evolve as opposed to the normally individualistic view of morality.

Nina Rimsky 14 Aug 2022 19:21 UTC
1 point
1
in reply to: Jesse Kanner’s comment on: What kind of moral framework would you spread if you could decide?
Fair enough! I like the spirit of this answer, probably broadly agree, although makes me think “surely I’d want to modify some people’s moral beliefs”…

Nina Rimsky 24 Aug 2022 20:18 UTC
1 point
0
on: A Mechanistic Interpretability Analysis of Grokking
Status—rough thoughts inspired by skimming this post (want to read in more detail soon!)

Do you think that hand-crafted mathematical functions (potentially slightly more complex ones than the ones mentioned in this research) could be a promising testbed for various alignment techniques? Doing prosaic alignment research with LLMs or huge RL agents is very compute and data hungry, making the process slower and more expensive. I wonder whether there is a way to investigate similar questions with carefully crafted exact functions which can be used to generate enough data quickly, scale down to smaller models, and can be tweaked in different ways to adjust the experiments.

One rough idea I have is training one DNN to implement different simple functions for different sets of numbers, and then seeing how the model generalises OOD given different training methods / alignment techniques.

Nina Rimsky 25 Aug 2022 22:49 UTC
1 point
0
in reply to: Neel Nanda’s comment on: A Mechanistic Interpretability Analysis of Grokking
Makes sense. I agree that something working on algorithmic tasks is very weak evidence, although I am somewhat interested in how much insight can we get if we put more effort into hand-crafting algorithmic tasks with interesting properties.

Nina Rimsky 31 Dec 2022 17:06 UTC
1 point
0
on: What AI Safety Materials Do ML Researchers Find Compelling?
I wonder whether https://arxiv.org/pdf/2109.13916.pdf would be a successful resource in this scenario (Unsolved Problems in ML Safety by Hendrycks, Carlini, Schulman and Steinhardt)

Nina Rimsky 7 Jan 2023 2:24 UTC
3 points
1
on: Optimizing Human Collective Intelligence to Align AI
This reminded me of a technique I occasionally use to explore a new topic area via some version of “graph search”. I ask LLMs (or previously google) “what are topics/concepts adjacent to (/related to/ similar to) X”. Recursing, and reading up on connected topics for a while, can be an effective way of getting a broad overview of a new knowledge space.

Optimising the process for AIS research topics seems like it could be valuable. I wonder whether a tool like Elicit solves this (haven’t actually tried it though).

Nina Rimsky 12 Jan 2023 19:51 UTC
12 points
3
on: Iron deficiencies are very bad and you should treat them
Personal anecdote so obviously all n=1 caveats apply—I took light iron supplementation for a few months (one Spatone sachet per day) and it completely changed my life. Before, I could not run more than a mile, in 10 minutes, before collapsing. I got winded going up stairs, was often physically fatigued (although no other mental or non-fitness-related physical symptoms). After a few months of iron and no other lifestyle changes, I could run for an hour at 8 min/mile pace. Have stopped taking the supplements and benefits have sustained for 2 years. If you have mild iron deficiency, I really do suggest addressing it as the lifestyle gains could be really big, and I can recommend Spatone iron water as a delivery mechanism with fewer side effects.

Nina Rimsky 29 Jan 2023 14:58 UTC
1 point
0
on: Thoughts on the impact of RLHF research
It seems like most/all large models (especially language models) will be first trained in a similar way, using self-supervised learning on large unlabelled raw datasets (such as web text), and it looks like there is limited room for manoeuver/creativity in shaping the objective or training process when it comes to this stage. Fundamentally, this stage is just about developing a really good compression algorithm for all the training data.

The next stage, when we try and direct the model to perform a certain task (either trivially, via prompting, or via fine-tuning from human preference data, or something else) seems to be where most of the variance in outcomes/safety will come in, at least in the current paradigm. Therefore, I think it could be worth ML safety researchers focusing on analyzing and optimizing this second stage as a way of narrowing the problem/experiment space. I think mech interp focused on the reward model used in RLHF could be an interesting direction here.

Nina Rimsky 29 Jan 2023 21:45 UTC
2 points
0
on: On not getting contaminated by the wrong obesity ideas
I wonder how clear it is that increasing average human BMI is bad. It seems very true that being obese is bad for health outcomes, but maybe this is compensated for by a reduction in the number of underweight individuals + better nutrition for non-morbidly-obese people.

Nina Rimsky 25 Feb 2023 15:50 UTC
8 points
4
on: Are there rationality techniques similar to staring at the wall for 4 hours?
Walking a very long distance (15km+), preferably in a not too exciting place (eg residential streets, fields), while thinking, maybe occasionally listening to music to reset. Works best in daylight but when it’s not too bright and sunny and not too warm / cold.

Nina Rimsky 2 Apr 2023 22:36 UTC
11 points
8
on: Policy discussions follow strong contextualizing norms
Agree with this post.

Another way I think about this is, if I have a strong reason to believe my audience will interpret my words as X, and I don’t want to say X, I should not use those words. Even if I think the words are the most honest/accurate/correct way of precisely conveying my message.

People on LessWrong have a high honesty and integrity bar but the same language conveys different info in other contexts and may therefore be de facto less honest in those contexts.

This being said, I can see a counterargument that is: it is fundamentally more honest if people use a consistent language and don’t adapt their language to scenarios, as it is easier for any other agent to model and extract facts from truthful agent + consistent language vs truthful agent + adaptive language.

Nina Rimsky 8 Apr 2023 19:28 UTC
1 point
0
on: Misgeneralization as a misnomer
I think a key idea referenced in this post is that an AI trained with modern techniques never directly “sees” / interfaces with a clear, well defined goal. We “feel” like there is a true goal or objective, as we encode something of this flavour in the training loop—the reward or objective function for example. However, in the end the only thing you’re really doing to the AI is changing it’s state after registering its output given some input, and ending up at some point in program-space. Sure, that path is guided by the cleanly specified goal function, but it is not explicitly given to the resultant program.

I do think “goal misgeneralisation” has a place in referring to the phenomenon that:
1. In the limit of infinite training data and training time, the optimisation procedure should converge a model to an ideal implementation of the objective function encoded into its training loop
2. Before this limit, the trajectory in program-space may be skewed away from the optimal program leading to unintended results—“misgeneralisation”
A confounder here is that modern AI training objectives are fundamentally un-extendable-to-infinity and so misgeneralisation is ill-defined. For example, “predict the next token in this human-generated text” is bound by humans generating text, and “maximise the human’s response to X” is bound by number of humans, number of interactions with X. Most loss functions make no sense outside of the type of data they are defined on, and so there exists no such thing as perfect generalisation as data is by definition limited.

You could redefine “perfect generalisation” to mean optimal performance on the available data, however, as long as it is possible to produce more data at some point in the future, even a finite amount, this definition is brittle.

Nina Rimsky

The poin­t­ers prob­lem, distilled

The pointers problem, distilled