Has anyone tried to work with neural networks predicting the weights of other neural networks? I’m thinking about that in the context of something like subsystem alignment, e.g. in an RL setting where an agent first learns about the environment, and then creates the subagent (by outputting the weights or some embedding of its policy) who actually obtains some reward
Ariel Kwiatkowski
But isn’t the whole point that the hotel is full initially, and yet can accept more guests?
Looking for research idea feedback:
Learning to manipulate: consider a system with a large population of agents working on a certain goal, either learned or rule-based, but at this point—fixed. This could be an environment of ants using pheromones to collect food and bring it home.
Now add another agent (or some number of them) which learns in this environment, and tries to get other agents to instead fulfil a different goal. It could be ants redirecting others to a different “home”, hijacking their work.
Does this sound interesting? If it works, would it potentially be publishable as a research paper? (or at least a post on LW) Any other feedback is welcome!
“Overall, it continually gets more expensive to do the same amount of work”
This doesn’t seem supported by the graph? I might be misunderstanding something, but it seems like research funding essentially followed inflation, so it didn’t get more expensive in any meaningful terms. The trend even seems to be a little bit downwards for the real value.
Isn’t this extremely easy to directly verify empirically?
Take a neural network $f$ trained on some standard task, like ImageNet or something. Evaluate $|f(kx) - kf(x)|$ on a bunch of samples $x$ from the dataset, and $f(x+y) - f(x) - f(y)$ on samples $x, y$. If it’s “almost linear”, then the difference should be very small on average. I’m not sure right now how to define “very small”, but you could compare it e.g. to the distance distribution $|f(x) - f(y)|$ of independent samples, also depending on what the head is.
FWIW my opinion is that all this “circumstantial evidence” is a big non sequitur, and the base statement is fundamentally wrong. But it seems like such an easily testable hypothesis that it’s more effort to discuss it than actually verify it.
I would be interested in some advice going a step further—assuming a roughly sufficient technical skill level (in my case, soon-to-be PhD in an application of ML), as well as an interest in the field, how to actually enter the field with a full-time position? I know independent research is one option, but it has its pros and cons. And companies which are interested in alignment are either very tiny (=not many positions), or very huge (like OpenAI et al., =very selective)
Counterpoint: this is needlessly pedantic and a losing fight.
My understanding of the core argument is that “agent” in alignment/safety literature has a slightly different meaning than “agent” in RL. It might be the case that the difference turns out to be important, but there’s still some connection between the two meanings.
I’m not going to argue that RL inherently creates “agentic” systems in the alignment sense. I suspect there’s at least a strong correlation there (i.e. an RL-trained agent will typically create an agentic system), but that’s honestly beside the point.
The term “RL agent” is very well entrenched and de facto a correct technical term for that part of the RL formalism. Just because alignment people use that term differently, doesn’t justify going into neighboring fields and demanding them to change their ways.
It’s kinda like telling biologists that they shouldn’t use the word [matrix](https://en.wikipedia.org/wiki/Matrix_(biology)) because actual matrices are arrays of numbers (or linear maps whatever, mathematicians don’t @ me)
And finally, as an example why even if I drank the kool-aid, I absolutely couldn’t do the switch you’re recommending—what about multiagent RL? Especially one with homogeneous agents. Doing s/agent/policy/g won’t work, because a multiagent algorithm doesn’t have to be multipolicy.
The appendix on s/reward/reinforcement/g is even more silly in my opinion. RL agents (heh) are designed to seek out the reward. They might fail, but that’s the overarching goal.
I feel like this is one of the cases where you need to be very precise about your language, and be careful not to use an “analogous” problem which actually changes the situation.
Consider the first “bajillion dollars vs dying” variant. We know that right now, there’s about 8B humans alive. What happens if the exponential increase exceed that number? We probably have to assume there’s an infinite number of humans, fair enough.
What does it mean that “you’ve chosen to play”? This implies some intentionality, but due to the structure of the game, where the number of players is random, it’s not really just up to you.
NOTE: I just realized that the original wording is “you’re chosen to play” rather than “you’ve chosen to play”. Damn you, English. I will keep the three variants below, but this means that the right interpretation clearly points towards option B), but the analysis of various interpretations can explain why we even see this as a paradox.
A) One interpretation is “what is the probability that I died given that I played the game?”, to which the answer is 0%, because if I died, I wouldn’t be around to ask this question.
B) Second interpretation is “Organizer told you there’s a slot for you tomorrow in the next (or first) batch. What is the probability that you will die given that you are going to play the game?”. Here the answer is pretty trivially 1⁄36. You don’t need anthropics, counterfactual worlds, blue skies. You will roll a dice, and your survival will entirely depend on the outcome of that roll.
C) The potentially interesting interpretation, that I heard somewhere (possibly here) is: “You heard that your friend participated in this game. Given this information, what is the probability that your friend died during the game?”. The probability here will be about 50% -- we know that if N people in total participated, about N/2 people will have died.
Consider now the second variant with snakes and colors. Before the god starts his wicked game, do snakes exist? Or is he creating the snakes as he goes? The first sentence “I am a god, creating snakes.” seems to imply that this is the process of how all snakes are created. This is important, because it messes with some interpretations. Another complication is that now, “losing” the roll no longer deletes you from existence, which similarly changes interpretations. Let’s look at the three variants again.
A) “What is the probability you have red eyes given that you were created in this process?”—here the answer will be ~50%, following the same global population argument as in variant C of the first variant. This is the interpretation you seem to be going with in your analysis, which is notably different than the interpretation that seems to be valid in the first variant.
B) If snakes are being created as you go with the batches, this no longer has a meaning. The snake can’t reflect on what will happen to him if he’s chosen to be created, because he doesn’t exist.
C) “Some time after this process, you befriended a snake who’s always wearing shades. You find out how he was created. Given this, what is the probability that he has red eyes?”—the answer, following again the same global population argument, is ~50%
In summary, we need to be careful switching to a “less violent” equivalent, because it can often entirely change the problem.
Does the original paper even refer to x-risk? The word “alignment” doesn’t necessarily imply that specific aspect.
When you say “X is not a paradox”, how do you define a paradox?
If Orthogonal wants to ever be taken seriously, by far the most important thing is improving the public-facing communication. I invested a more-than-fair amount of time (given the strong prior for “it won’t work” with no author credentials, proof-of-concepts, or anything that would quickly nudge that prior) trying to understand QACI, and why it’s not just gibberish (both through reading LW posts and interacting with authors/contributors on the discord server), and I’m still mostly convinced there is absolutely nothing of value in this direction.
And now there’s this 10k-word-long post, roughly the size of an actual research paper, with no early indication that there’s any value to be obtained by reading the whole thing. I know, I’m “telling on myself” by commenting without reading this post, but y’all rarely get any significant comments on LW posts about QACI (as this post points out), and this might be the reason.
The way I see it, the whole thing has the impressive balance of being extremely hand-wavy as a whole, written up in an extremely “chill and down with the kids” manner, with bits and pieces of math sprinkled in various places, often done incorrectly.
Maybe the general academic formalism isn’t the worst thing after all—you need an elevator pitch, an abstract, something to read in a minute or two that will give the general idea of what’s going on. Then an introduction, expanding on those ideas and providing some more context. And then the rest of the damn research (which I know is in a very early stage and preparadigmatic and all that—but that’s not an excuse for bad communication)
Is it a thing now to post LLM-generated comments on LW?
It’s really good to see this said out loud. I don’t necessarily have a broad overview of the funding field, just my experiences of trying to get into it—both into established orgs, or trying to get funding for individual research, or for alignment-adjacent stuff—and ending up in a capabilities research company.
I wonder if this is simply the result of the generally bad SWE/CS market right now. People who would otherwise be in big tech/other AI stuff, will be more inclined to do something with alignment. Similarly, if there’s less money in overall tech (maybe outside of LLM-based scams), there may be less money for alignment.
My point is that your comment was extremely shallow, with a bunch of irrelevant information, and in general plagued with the annoying ultra-polite ChatGPT style—in total, not contributing anything to the conversation. You’re now defensive about it and skirting around answering the question in the other comment chain (“my endorsed review”), so you clearly intuitively see that this wasn’t a good contribution. Try to look inwards and understand why.
Often academics justify this on the grounds that you’re receiving more than just monetary benefits: you’re receiving mentorship and training. We think the same will be true for these positions.
I don’t buy this. I’m actually going through the process of getting a PhD at ~40k USD per year, and one of the main reasons why I’m sticking with it is that after that, I have a solid credential that’s recognized worldwide, backed by a recognizable name (i.e. my university and my supervisor). You can’t provide either of those things.
This offer seems to take the worst of both worlds between academia and industry, but if you actually find someone good at this rate, good for you I suppose
Is this surprising though? When I read the title I was thinking “Yea, that seems pretty obvious”
There’s a pretty significant difference here in my view—“carnists” are not a coherent group, not an ideology, they do not have an agenda (unless we’re talking about some very specific industry lobbyists who no doubt exist). They’re just people who don’t care and eat meat.
Ideological vegans (i.e. not people who just happen to not eat meat, but don’t really care either way) are a very specific ideological group, and especially if we qualify them like in this post (“EA vegan advocates”), we can talk about their collective traits.
Jesus christ, chill. I don’t like playing into the meme of “that’s why people don’t like vegans”, but that’s exactly why.
And posting something insane followed by an edit of “idk if I endorse comments like this” has got to be the most online rationalist thing ever.
In what sense do you think it will (might) not go well? My guess is that it will not go at all—some people will show up in the various locations, maybe some local news outlets will pick it up, and within a week it will be forgotten
This reminds me of an idea bouncing around my mind recently, admittedly not aiming to solve this problem, but possibly exhibiting it.
Drawing inspiration from human evolution, then given a sufficiently rich environment where agents have some necessities for surviving (like gathering food), they could be pretrained with something like a survival prior which doesn’t require any specific reward signals.
Then, agents produced this way could be fine-tuned for downstream tasks, or in a way obeying orders. The problem would arise when an agent is given an order that results in its death. We might want to ensure it follows its original (survival) instinct, unless overridden by a more specific order.
And going back to a multiagent scenario, similar issues might arise when the order would require antisocial behavior in a usually cooperative environment. The AI Economist comes to mind where that could come into play, where agents actually learn some nontrivial social relations https://blog.einstein.ai/the-ai-economist/