Has anyone tried to work with neural networks predicting the weights of other neural networks? I’m thinking about that in the context of something like subsystem alignment, e.g. in an RL setting where an agent first learns about the environment, and then creates the subagent (by outputting the weights or some embedding of its policy) who actually obtains some reward
Ariel Kwiatkowski
[Question] How to choose a PhD with AI Safety in mind
But isn’t the whole point that the hotel is full initially, and yet can accept more guests?
Looking for research idea feedback:
Learning to manipulate: consider a system with a large population of agents working on a certain goal, either learned or rule-based, but at this point—fixed. This could be an environment of ants using pheromones to collect food and bring it home.
Now add another agent (or some number of them) which learns in this environment, and tries to get other agents to instead fulfil a different goal. It could be ants redirecting others to a different “home”, hijacking their work.
Does this sound interesting? If it works, would it potentially be publishable as a research paper? (or at least a post on LW) Any other feedback is welcome!
[Question] How to validate research ideas?
[Question] Competence vs Alignment
AISC5 Retrospective: Mechanisms for Avoiding Tragedy of the Commons in Common Pool Resource Problems
[Question] Alignment-related jobs outside of London/SF
[Question] Thoughts about Hugging Face?
Why I’m not worried about imminent doom
“Overall, it continually gets more expensive to do the same amount of work”
This doesn’t seem supported by the graph? I might be misunderstanding something, but it seems like research funding essentially followed inflation, so it didn’t get more expensive in any meaningful terms. The trend even seems to be a little bit downwards for the real value.
Isn’t this extremely easy to directly verify empirically?
Take a neural network $f$ trained on some standard task, like ImageNet or something. Evaluate $|f(kx) - kf(x)|$ on a bunch of samples $x$ from the dataset, and $f(x+y) - f(x) - f(y)$ on samples $x, y$. If it’s “almost linear”, then the difference should be very small on average. I’m not sure right now how to define “very small”, but you could compare it e.g. to the distance distribution $|f(x) - f(y)|$ of independent samples, also depending on what the head is.
FWIW my opinion is that all this “circumstantial evidence” is a big non sequitur, and the base statement is fundamentally wrong. But it seems like such an easily testable hypothesis that it’s more effort to discuss it than actually verify it.
I would be interested in some advice going a step further—assuming a roughly sufficient technical skill level (in my case, soon-to-be PhD in an application of ML), as well as an interest in the field, how to actually enter the field with a full-time position? I know independent research is one option, but it has its pros and cons. And companies which are interested in alignment are either very tiny (=not many positions), or very huge (like OpenAI et al., =very selective)
Counterpoint: this is needlessly pedantic and a losing fight.
My understanding of the core argument is that “agent” in alignment/safety literature has a slightly different meaning than “agent” in RL. It might be the case that the difference turns out to be important, but there’s still some connection between the two meanings.
I’m not going to argue that RL inherently creates “agentic” systems in the alignment sense. I suspect there’s at least a strong correlation there (i.e. an RL-trained agent will typically create an agentic system), but that’s honestly beside the point.
The term “RL agent” is very well entrenched and de facto a correct technical term for that part of the RL formalism. Just because alignment people use that term differently, doesn’t justify going into neighboring fields and demanding them to change their ways.
It’s kinda like telling biologists that they shouldn’t use the word [matrix](https://en.wikipedia.org/wiki/Matrix_(biology)) because actual matrices are arrays of numbers (or linear maps whatever, mathematicians don’t @ me)
And finally, as an example why even if I drank the kool-aid, I absolutely couldn’t do the switch you’re recommending—what about multiagent RL? Especially one with homogeneous agents. Doing s/agent/policy/g won’t work, because a multiagent algorithm doesn’t have to be multipolicy.
The appendix on s/reward/reinforcement/g is even more silly in my opinion. RL agents (heh) are designed to seek out the reward. They might fail, but that’s the overarching goal.
I feel like this is one of the cases where you need to be very precise about your language, and be careful not to use an “analogous” problem which actually changes the situation.
Consider the first “bajillion dollars vs dying” variant. We know that right now, there’s about 8B humans alive. What happens if the exponential increase exceed that number? We probably have to assume there’s an infinite number of humans, fair enough.
What does it mean that “you’ve chosen to play”? This implies some intentionality, but due to the structure of the game, where the number of players is random, it’s not really just up to you.
NOTE: I just realized that the original wording is “you’re chosen to play” rather than “you’ve chosen to play”. Damn you, English. I will keep the three variants below, but this means that the right interpretation clearly points towards option B), but the analysis of various interpretations can explain why we even see this as a paradox.
A) One interpretation is “what is the probability that I died given that I played the game?”, to which the answer is 0%, because if I died, I wouldn’t be around to ask this question.
B) Second interpretation is “Organizer told you there’s a slot for you tomorrow in the next (or first) batch. What is the probability that you will die given that you are going to play the game?”. Here the answer is pretty trivially 1⁄36. You don’t need anthropics, counterfactual worlds, blue skies. You will roll a dice, and your survival will entirely depend on the outcome of that roll.
C) The potentially interesting interpretation, that I heard somewhere (possibly here) is: “You heard that your friend participated in this game. Given this information, what is the probability that your friend died during the game?”. The probability here will be about 50% -- we know that if N people in total participated, about N/2 people will have died.
Consider now the second variant with snakes and colors. Before the god starts his wicked game, do snakes exist? Or is he creating the snakes as he goes? The first sentence “I am a god, creating snakes.” seems to imply that this is the process of how all snakes are created. This is important, because it messes with some interpretations. Another complication is that now, “losing” the roll no longer deletes you from existence, which similarly changes interpretations. Let’s look at the three variants again.
A) “What is the probability you have red eyes given that you were created in this process?”—here the answer will be ~50%, following the same global population argument as in variant C of the first variant. This is the interpretation you seem to be going with in your analysis, which is notably different than the interpretation that seems to be valid in the first variant.
B) If snakes are being created as you go with the batches, this no longer has a meaning. The snake can’t reflect on what will happen to him if he’s chosen to be created, because he doesn’t exist.
C) “Some time after this process, you befriended a snake who’s always wearing shades. You find out how he was created. Given this, what is the probability that he has red eyes?”—the answer, following again the same global population argument, is ~50%
In summary, we need to be careful switching to a “less violent” equivalent, because it can often entirely change the problem.
Does the original paper even refer to x-risk? The word “alignment” doesn’t necessarily imply that specific aspect.
When you say “X is not a paradox”, how do you define a paradox?
If Orthogonal wants to ever be taken seriously, by far the most important thing is improving the public-facing communication. I invested a more-than-fair amount of time (given the strong prior for “it won’t work” with no author credentials, proof-of-concepts, or anything that would quickly nudge that prior) trying to understand QACI, and why it’s not just gibberish (both through reading LW posts and interacting with authors/contributors on the discord server), and I’m still mostly convinced there is absolutely nothing of value in this direction.
And now there’s this 10k-word-long post, roughly the size of an actual research paper, with no early indication that there’s any value to be obtained by reading the whole thing. I know, I’m “telling on myself” by commenting without reading this post, but y’all rarely get any significant comments on LW posts about QACI (as this post points out), and this might be the reason.
The way I see it, the whole thing has the impressive balance of being extremely hand-wavy as a whole, written up in an extremely “chill and down with the kids” manner, with bits and pieces of math sprinkled in various places, often done incorrectly.
Maybe the general academic formalism isn’t the worst thing after all—you need an elevator pitch, an abstract, something to read in a minute or two that will give the general idea of what’s going on. Then an introduction, expanding on those ideas and providing some more context. And then the rest of the damn research (which I know is in a very early stage and preparadigmatic and all that—but that’s not an excuse for bad communication)
Is it a thing now to post LLM-generated comments on LW?
This reminds me of an idea bouncing around my mind recently, admittedly not aiming to solve this problem, but possibly exhibiting it.
Drawing inspiration from human evolution, then given a sufficiently rich environment where agents have some necessities for surviving (like gathering food), they could be pretrained with something like a survival prior which doesn’t require any specific reward signals.
Then, agents produced this way could be fine-tuned for downstream tasks, or in a way obeying orders. The problem would arise when an agent is given an order that results in its death. We might want to ensure it follows its original (survival) instinct, unless overridden by a more specific order.
And going back to a multiagent scenario, similar issues might arise when the order would require antisocial behavior in a usually cooperative environment. The AI Economist comes to mind where that could come into play, where agents actually learn some nontrivial social relations https://blog.einstein.ai/the-ai-economist/