Ariel Kwiatkowski

Karma: 217

[Question] How to choose a PhD with AI Safety in mind

Ariel Kwiatkowski15 May 2020 22:19 UTC

9 points

1 comment1 min readLW link

Ariel Kwiatkowski 16 May 2020 21:49 UTC
3 points
on: Multi-agent safety
This reminds me of an idea bouncing around my mind recently, admittedly not aiming to solve this problem, but possibly exhibiting it.
Drawing inspiration from human evolution, then given a sufficiently rich environment where agents have some necessities for surviving (like gathering food), they could be pretrained with something like a survival prior which doesn’t require any specific reward signals.
Then, agents produced this way could be fine-tuned for downstream tasks, or in a way obeying orders. The problem would arise when an agent is given an order that results in its death. We might want to ensure it follows its original (survival) instinct, unless overridden by a more specific order.
And going back to a multiagent scenario, similar issues might arise when the order would require antisocial behavior in a usually cooperative environment. The AI Economist comes to mind where that could come into play, where agents actually learn some nontrivial social relations https://blog.einstein.ai/the-ai-economist/

Ariel Kwiatkowski 30 May 2020 19:58 UTC
11 points
on: Ariel Kwiatkowski’s Shortform
Has anyone tried to work with neural networks predicting the weights of other neural networks? I’m thinking about that in the context of something like subsystem alignment, e.g. in an RL setting where an agent first learns about the environment, and then creates the subagent (by outputting the weights or some embedding of its policy) who actually obtains some reward

Ariel Kwiatkowski 30 May 2020 20:02 UTC
7 points
in reply to: Draconarius’s comment on: Draconarius’s Shortform
But isn’t the whole point that the hotel is full initially, and yet can accept more guests?

Ariel Kwiatkowski 4 Jun 2020 21:25 UTC
4 points
on: Ariel Kwiatkowski’s Shortform
Looking for research idea feedback:
Learning to manipulate: consider a system with a large population of agents working on a certain goal, either learned or rule-based, but at this point—fixed. This could be an environment of ants using pheromones to collect food and bring it home.
Now add another agent (or some number of them) which learns in this environment, and tries to get other agents to instead fulfil a different goal. It could be ants redirecting others to a different “home”, hijacking their work.

Does this sound interesting? If it works, would it potentially be publishable as a research paper? (or at least a post on LW) Any other feedback is welcome!

[Question] How to validate research ideas?

Ariel Kwiatkowski4 Jun 2020 21:37 UTC

12 points

2 comments1 min readLW link

[Question] Competence vs Alignment

Ariel Kwiatkowski30 Sep 2020 21:03 UTC

7 points

4 comments1 min readLW link

AISC5 Retrospective: Mechanisms for Avoiding Tragedy of the Commons in Common Pool Resource Problems

Ariel Kwiatkowski, Quinn and bengr

27 Sep 2021 16:46 UTC

8 points

3 comments7 min readLW link

[Question] Alignment-related jobs outside of London/SF

Ariel Kwiatkowski23 Mar 2023 13:24 UTC

26 points

14 comments1 min readLW link

[Question] Thoughts about Hugging Face?

Ariel Kwiatkowski7 Apr 2023 23:17 UTC

7 points

0 comments1 min readLW link

Why I’m not worried about imminent doom

Ariel Kwiatkowski10 Apr 2023 15:31 UTC

6 points

2 comments4 min readLW link

Ariel Kwiatkowski 17 Apr 2023 8:54 UTC
3 points
2
on: My experience getting funding for my biological research
“Overall, it continually gets more expensive to do the same amount of work”
This doesn’t seem supported by the graph? I might be misunderstanding something, but it seems like research funding essentially followed inflation, so it didn’t get more expensive in any meaningful terms. The trend even seems to be a little bit downwards for the real value.

Ariel Kwiatkowski 24 Apr 2023 20:58 UTC
5 points
1
on: Deep learning models might be secretly (almost) linear
Isn’t this extremely easy to directly verify empirically?
Take a neural network $f$ trained on some standard task, like ImageNet or something. Evaluate $|f(kx) - kf(x)|$ on a bunch of samples $x$ from the dataset, and $f(x+y) - f(x) - f(y)$ on samples $x, y$. If it’s “almost linear”, then the difference should be very small on average. I’m not sure right now how to define “very small”, but you could compare it e.g. to the distance distribution $|f(x) - f(y)|$ of independent samples, also depending on what the head is.
FWIW my opinion is that all this “circumstantial evidence” is a big non sequitur, and the base statement is fundamentally wrong. But it seems like such an easily testable hypothesis that it’s more effort to discuss it than actually verify it.

Ariel Kwiatkowski 2 May 2023 11:27 UTC
3 points
0
on: AGI safety career advice
I would be interested in some advice going a step further—assuming a roughly sufficient technical skill level (in my case, soon-to-be PhD in an application of ML), as well as an interest in the field, how to actually enter the field with a full-time position? I know independent research is one option, but it has its pros and cons. And companies which are interested in alignment are either very tiny (=not many positions), or very huge (like OpenAI et al., =very selective)

Ariel Kwiatkowski 2 Jun 2023 9:48 UTC
4 points
−9
on: Think carefully before calling RL policies “agents”
Counterpoint: this is needlessly pedantic and a losing fight.
My understanding of the core argument is that “agent” in alignment/safety literature has a slightly different meaning than “agent” in RL. It might be the case that the difference turns out to be important, but there’s still some connection between the two meanings.
I’m not going to argue that RL inherently creates “agentic” systems in the alignment sense. I suspect there’s at least a strong correlation there (i.e. an RL-trained agent will typically create an agentic system), but that’s honestly beside the point.
The term “RL agent” is very well entrenched and de facto a correct technical term for that part of the RL formalism. Just because alignment people use that term differently, doesn’t justify going into neighboring fields and demanding them to change their ways.
It’s kinda like telling biologists that they shouldn’t use the word [matrix](https://en.wikipedia.org/wiki/Matrix_(biology)) because actual matrices are arrays of numbers (or linear maps whatever, mathematicians don’t @ me)
And finally, as an example why even if I drank the kool-aid, I absolutely couldn’t do the switch you’re recommending—what about multiagent RL? Especially one with homogeneous agents. Doing s/agent/policy/g won’t work, because a multiagent algorithm doesn’t have to be multipolicy.
The appendix on s/reward/reinforcement/g is even more silly in my opinion. RL agents (heh) are designed to seek out the reward. They might fail, but that’s the overarching goal.

Ariel Kwiatkowski 11 Jun 2023 10:34 UTC
5 points
2
on: Snake Eyes Paradox
I feel like this is one of the cases where you need to be very precise about your language, and be careful not to use an “analogous” problem which actually changes the situation.
Consider the first “bajillion dollars vs dying” variant. We know that right now, there’s about 8B humans alive. What happens if the exponential increase exceed that number? We probably have to assume there’s an infinite number of humans, fair enough.
What does it mean that “you’ve chosen to play”? This implies some intentionality, but due to the structure of the game, where the number of players is random, it’s not really just up to you.
NOTE: I just realized that the original wording is “you’re chosen to play” rather than “you’ve chosen to play”. Damn you, English. I will keep the three variants below, but this means that the right interpretation clearly points towards option B), but the analysis of various interpretations can explain why we even see this as a paradox.
A) One interpretation is “what is the probability that I died given that I played the game?”, to which the answer is 0%, because if I died, I wouldn’t be around to ask this question.
B) Second interpretation is “Organizer told you there’s a slot for you tomorrow in the next (or first) batch. What is the probability that you will die given that you are going to play the game?”. Here the answer is pretty trivially ¹⁄₃₆. You don’t need anthropics, counterfactual worlds, blue skies. You will roll a dice, and your survival will entirely depend on the outcome of that roll.
C) The potentially interesting interpretation, that I heard somewhere (possibly here) is: “You heard that your friend participated in this game. Given this information, what is the probability that your friend died during the game?”. The probability here will be about 50% -- we know that if N people in total participated, about N/2 people will have died.
Consider now the second variant with snakes and colors. Before the god starts his wicked game, do snakes exist? Or is he creating the snakes as he goes? The first sentence “I am a god, creating snakes.” seems to imply that this is the process of how all snakes are created. This is important, because it messes with some interpretations. Another complication is that now, “losing” the roll no longer deletes you from existence, which similarly changes interpretations. Let’s look at the three variants again.
A) “What is the probability you have red eyes given that you were created in this process?”—here the answer will be ~50%, following the same global population argument as in variant C of the first variant. This is the interpretation you seem to be going with in your analysis, which is notably different than the interpretation that seems to be valid in the first variant.
B) If snakes are being created as you go with the batches, this no longer has a meaning. The snake can’t reflect on what will happen to him if he’s chosen to be created, because he doesn’t exist.
C) “Some time after this process, you befriended a snake who’s always wearing shades. You find out how he was created. Given this, what is the probability that he has red eyes?”—the answer, following again the same global population argument, is ~50%
In summary, we need to be careful switching to a “less violent” equivalent, because it can often entirely change the problem.

Ariel Kwiatkowski 13 Jun 2023 14:58 UTC
5 points
2
on: MetaAI: less is less for alignment.
Does the original paper even refer to x-risk? The word “alignment” doesn’t necessarily imply that specific aspect.

Ariel Kwiatkowski 18 Jun 2023 14:38 UTC
1 point
0
on: Liar’s Paradox is a misnomer and not a paradox
When you say “X is not a paradox”, how do you define a paradox?

Ariel Kwiatkowski 19 Jul 2023 16:52 UTC
19 points
13
on: The Full Alignment Plan You’ve Never Heard Of
If Orthogonal wants to ever be taken seriously, by far the most important thing is improving the public-facing communication. I invested a more-than-fair amount of time (given the strong prior for “it won’t work” with no author credentials, proof-of-concepts, or anything that would quickly nudge that prior) trying to understand QACI, and why it’s not just gibberish (both through reading LW posts and interacting with authors/contributors on the discord server), and I’m still mostly convinced there is absolutely nothing of value in this direction.
And now there’s this 10k-word-long post, roughly the size of an actual research paper, with no early indication that there’s any value to be obtained by reading the whole thing. I know, I’m “telling on myself” by commenting without reading this post, but y’all rarely get any significant comments on LW posts about QACI (as this post points out), and this might be the reason.
The way I see it, the whole thing has the impressive balance of being extremely hand-wavy as a whole, written up in an extremely “chill and down with the kids” manner, with bits and pieces of math sprinkled in various places, often done incorrectly.
Maybe the general academic formalism isn’t the worst thing after all—you need an elevator pitch, an abstract, something to read in a minute or two that will give the general idea of what’s going on. Then an introduction, expanding on those ideas and providing some more context. And then the rest of the damn research (which I know is in a very early stage and preparadigmatic and all that—but that’s not an excuse for bad communication)

Ariel Kwiatkowski 19 Jul 2023 16:56 UTC
3 points
2
in reply to: Past Account’s comment on: The Full Alignment Plan You’ve Never Heard Of
Is it a thing now to post LLM-generated comments on LW?

Ariel Kwiatkowski

[Question] How to choose a PhD with AI Safety in mind

[Question] How to val­i­date re­search ideas?

[Question] Com­pe­tence vs Alignment

AISC5 Ret­ro­spec­tive: Mechanisms for Avoid­ing Tragedy of the Com­mons in Com­mon Pool Re­source Problems

[Question] Align­ment-re­lated jobs out­side of Lon­don/​SF

[Question] Thoughts about Hug­ging Face?

Why I’m not wor­ried about im­mi­nent doom